The Structured Text Toolkit Project -- Introduction

javadoc with linked source code.

2003-12-02 -- Source code will be available shortly.

The Structured Text Toolkit Project -- Introduction

The project focuses on an alternative to XML for structured text. On a basic level it aims to put a minimum of constraint on the text, i.e. it only needs to be indented correctly to be a valid structure. There are no 'types' of nodes such as 'element', 'attribute' or 'comment' -- only text.

Upon this basic layer, features such removing comments and allowing name=value notation can be added by activating (or writing) a set of operators that traverse the structure and do their bits on each node.

Schema validation based on regular expression, xpath evaluation, and resolution of local references with in files are among the higher level features of the toolkit.

The project currently has functioning, but non-optimized code for all these features, but there is still tons of work to be done.

A few example applications of the toolkit will hopefully help to clarify the direction of the project.

Example 1. Controlling the log-level at arbitrary points in the hierarchy of `log4j` loggers.

This simple example uses indented text to configure a Log4j logger hierarchy. Log4j loggers can be configured using java properties file but I find the following format somewhat more comfortable since the indentation reflects the hierarchy of the loggers.

    # Level settings for log4J.
    # level must be one of fatal, error, warn, info, debug or all.

    # root logger
    -error

    com
      saelist
        stx
          -info
          Pair=-error
          parser
            LstxParser=-error
        command
          AbstractCommand=-error
          ReplaceCommand=-info
          SmtpCommand=-info
          TransformCommand=-info
        util
          Strings=-info
          Net

Since Logger names usually don't start with a "-", it is used to mark distinguish loglevels from Loggers. The Java code to interpret this, using the toolkit, is given below, mainly to demonstrate its brevity:

    protected static void setLogLevels(String filename) throws IOException {
      Pair logConfig = LstxParser.parse(Strings.loadFile(filename));
      for(Iterator levels = logConfig.select("//*[starts-with(., '-')"); levels.hasNext(); ) {
        Pair level = (Pair) levels.next();
        Level logLevel = toLevel(level.getText().substring(1));
        getLogger(pathOf(level.getParent(), ".")).setLevel(logLevel);
      }
    }

The getLogger method returns the rootLogger for the empty string and the relevant logger for other strings.

The toPath method returns the path from the root to the given node for the first '-error' it would return com.saelist.stx.Pair.

The toLevel method converts the string (error, debug etc) into a Level object.

Example 2. Validation schema

A schema to validate email messages might looks like this.

    def
      email-address=[a-zA-Z0-9.-_]+@[a-zA-Z0-9.-_]+\.[a-zA-Z0-9.-_]+

    message *
      to +={def/email-address}
      cc *={def/email-address}
      bcc *={def/email-address}
      from={def/email-address}
      subject=.+
      attachment *
         mime-type ?=
         file=.+                            # Paths to files to be attached.
      is-html ?=true|false
      text
        .+ *

This fragment uses references to avoid the cluttering the schema with repeated instances of the email-address regular expression. Dereferencing is accomplished by running an dereferencing operator over the nodes.

Notice that x=y is equivalent to

    x
      y

There is no qualitative difference between names and values.

The schema is based on regular expressions on two levels. Each element has a qualifier expression that constrains how may be composed and a quantifier which constrains how often it may occur (?, + and * denote optinal, one-or-more, and zero-or-more. Default is exactly once).

A valid email message could be

    message
      to:someone@somewhere.org
      from:me@here.org
      subject:something
      attachement
        mime-type:image/jpeg
        file:a/b/c.jpg
      attachment
        file:d/e/f.gif
      text
        Hi there
        Some message,...
        that is it!
        enjoy.

Applying the schema to the data is up to the code. There is no automatic binding as in XML DTDs.

Example 3: Templates for tranformation

The following snippet demonstrats a simple templating engine built on top of this.

    template
      message
        to={job/email}
        cc=sent-cvs@peersoft.de
        from=hannes@textsoft.de
        subject=Contract role - {job/title} (reference: {job/reference})
        attachment
           mime-type=application/msword
           file=/home/hannes/cv-hannes.doc
        text
          Hello {job/first-name}
          I noted with intereest [...]

This template is applied to the following structure to yield a valid SMPT message according to example 2.

    job
      title=OO Developer
      job-type=Contract
      text=An OO Developer is required [...]
      location=City, London
      start=ASAP
      agency=JM Contracts
      contact=Sarah Brown
      first-name=Sarah
      last-name=Brown
      email=sarahb@jmms.co.uk
      reference=JSJMC-FESB14
      posted=26/11/2003 10:57:31

Conclusions

- Data structures that intelligently represent the input space can reduce the complexity of the code drastically.

- Using schemas to offload and standardize much of the input validation can also help reduce the volume of code and allow it to focus on the normal use case and away from the handling of exceptions.

- In many cases using indentation rather than free-form tagged text results in clearer less cluttered text.

- The approach taken by the toolkit is to treat the stucture on the lowest level as tree of homogenious nodes and leave to up to higher levels and application code to create distinctions such as tags, text, attributes, comments, processing instructions, etc. At the lowest level the Pair and Pairlist are simply Abstract Data Types akin to Map and List.