specification: Unotal - Syntax For Information

TOC

specification	S. Ram
	Stefan L. Ram
	February 11, 2006

Unotal - Syntax For Information

Abstract

This memo describes the notation "Unotal". Unotal might be used to represent textual as well as structured information. A unotal standard interpretation can be used to interpret a unotal room as an assertion. These denotational semantics of Unotal, however, are not part of this specification, which only specifies the syntax of Unotal. This specification does not contain a tutorial-like introduction or examples, which are available elsewhere.

1. Introduction
2. Status of productions and natural language text
3. Character Set and Encoding
4. Token Grammar
    4.1 White Space Tokens
    4.2 Free String Tokens
    4.3 Single Tokens
    4.4 Bracketed String Tokens
    4.5 Tokens
5. Base Structure
    5.1 Base Strings
    5.2 Base Rooms
    5.3 Base Expressions
6. Extended Structure
    6.1 Concatenation
    6.2 Assignments Process
    6.3 Comments
    6.4 Namespaces
    6.5 Types
    6.6 Rooms
    6.7 Expressions
§ Author's Address
A. Acknowledgements

TOC

1. Introduction

This specification intentionally contains no or little semantics, examples, explanations, or rationales. These might be added as separate documents. This document is intended to be a reference for the Unotal syntax. It is not a tutorial and might not be the best text to be read for a first introduction to Unotal.

TOC

2. Status of productions and natural language text

A Unotal unit is a tuple of Unicode-characters encoded with UTF-8 as being described by the rules of this specification. These rules can not be expressed by EBNF-productions only. Additional restrictions and explanations given in the english text of this specification apply, too.

TOC

3. Character Set and Encoding

The characters "%d" in the EBNF-productions precede a decimal number to specify the character with the code point given by the number.

<any character>        ::= %d0 - %d2097151.
<line feed>            ::= %d10.
<form feed>            ::= %d12.
<carriage return>      ::= %d13.
<end of file>          ::= %d29.
<space>                ::= %d32.
<exclamation mark>     ::= %d33.
<number sign>          ::= %d35.
<percent sign>         ::= %d37.
<ampersand>            ::= %d38.
<left parenthesis>     ::= %d40.
<right parenthesis>    ::= %d41.
<plus sign>            ::= %d43.
<comma>                ::= %d44.
<hyphen-minus>         ::= %d45.
<full stop>            ::= %d46.
<solidus>              ::= %d47.
<colon>                ::= %d58.
<less-than sign>       ::= %d60.
<equals sign>          ::= %d61.
<greater-than sign>    ::= %d62.
<left square bracket>  ::= %d91.
<reverse solidus>      ::= %d92.
<right square bracket> ::= %d93.
<low line>             ::= %d95.
<left curly bracket>   ::= %d123.
<right curly bracket>  ::= %d125.
<tilde>                ::= %d126.

TOC

4. Token Grammar

4.1 White Space Tokens

<white space character> ::= <line feed> | <form feed> |
<carriage return> | <space>.

<white space> ::= <white space character>
{<white space character>}.

4.2 Free String Tokens

A sequence of at least one <free string character> is considered to be a free string if it is neither directly preceded nor directly followed by a free string character.

The upper-case names of the following productions refer to the Unicode character categories of the same name. A "<Unicode-letter>", "<Unicode-Mark>", or "<Unicode-Number>" is a character, whose two-letter Unicode character category name starts with "L", "M", or "N", respectively.

<free string character> ::=
<Unicode-Letter> |
<Unicode-Mark> |
<Unicode-Number> |
<plus sign> |
<comma> |
<hyphen-minus> |
<full stop> |
<solidus> |
<colon> |
<low line>.

<free string> ::=
<free string character> { <free string character> }.

4.3 Single Tokens

A single token is a character that can not be a part of a free string. So a single token is never merged with other characters to form a multi-character token, unless it is part of a bracketed string.

<single token> ::=
( <exclamation mark> |
  <number sign> |
  <percent sign> |
  <ampersand> |
  <left parenthesis> |
  <right parenthesis> |
  <less-than sign> |
  <equals sign> |
  <greater-than sign> |
  <left curly bracket> |
  <right curly bracket> |
  <tilde> ).

4.4 Bracketed String Tokens

The reverse solidi "\" (read: "except") in the following productions precede symbols to be excluded from the set being described by the production. For example, "<any character> \ <left square bracket> \ <right square bracket>" means any character except the left and the right square bracket.

A <bracketed string> is a <bracketed text> that is not contained in any other <bracketed text>.

The text value of a <bracketed string> is its <text core>, with the final vertical bar removed if the text core should end in a sequence of vertical bars that is directly preceded by an <inverse solidus>. This sequence of vertical bars needs to consist of at least one vertical bar.

<bracketed text character> ::=
<any character> \ <left square bracket> \ <right square bracket>
\ <end of file> \ <reverse solidus>.

<bracketed text escape> ::= <reverse solidus>
( <left square bracket> | <right square bracket> | <reverse solidus> ).

<plain pair> ::= <reverse solidus> <bracketed text character>.

<bracketed text entry> ::=
<bracketed text character> | <bracketed text escape> |
<bracketed text> | <plain pair>.

<text core> ::= {<bracketed text entry>}.

<bracketed text> ::= <left square bracket> <text core>
<right square bracket>.

<bracketed string> ::= <bracketed text>.

4.5 Tokens

Some of the token types of this section are not used in any other productions of this text, but are defined here for reference in other texts. The symbol <error token> denotes a token that is different from any other tokens but otherwise unspecified.

The reverse solidi in the following productions precede symbols to be excluded from the set being described by the production.

<text token> ::= <white space> | <single token> |
<free string> | <bracketed string>.

<base core token> ::= <text token> \ <less-than sign> \
<greater-than sign>.

<generalized token> ::= <text token> | <end of file>.

<raw token> ::= <generalized token> | <error token>.

TOC

5. Base Structure

5.1 Base Strings

Starting with this section, additional white space may be inserted between all symbols on the right hand side of a production; it might also be inserted in front or after any such symbol. This white space is not shown explicitly in the productions.

White space must be used to separate two adjecent free strings, which otherwise would be regarded as one single free string. This white space is also not shown explicitly in the productions.

<base string> ::= <free string> | <bracketed string>.

5.2 Base Rooms

<base piece> ::= <base core token> | <base room>.

<base core> ::= {<base piece>}.

<base room> ::= <less-than sign> <base core> <greater-than sign>.

5.3 Base Expressions

<base expression> ::= <base room> | <base string>.

TOC

6. Extended Structure

The following sections add structure to the base structure.

For some applications, it might be appropriate to use the base structure only.

All of the following procedures are valid only directly within a room or outside of any room. I.e., they are not be valid within a bracketed string.

All of the following procedures are to be applied in the order, in which they are given here.

6.1 Concatenation

A <base room> is searched from left to right for <concatenations>. When a token is found that is a valid start of a concatenation, this concatenation must be extended as far as possible, that is, if a <concatenation> is followed by tokens so that a longer <concatenation> might be built, this has to be done.

For example, the room "<[a]~[b]~[c]>" contains one concatenation with three bracketed strings, not one <concatenation> followed by a tilde "~" and a bracketed string "[c]".

After this process the <concatenated room> is viewed as a a room containing <concatenation entry> symbols.

<concatenation> ::= <bracketed string> <tilde> <bracketed string>.
<concatenation> ::= <concatenation> <tilde> <bracketed string>.

<concatenation entry> ::=
<base entry> | <concatenation>.

<concatenated core> ::= <concatenation entry> |
<concatenated core> <concatenation entry>.

<concatenated room> ::= <less-than sign> <concatenated core> <greater-than sign>.

For example, the <concatenated room> "<a [b] ~ [c] [d]>" contains three <concatenation entry>-symbols, namely "a", "[b]~[c]", and "[d]". The <concatenated piece> "[b]~[c]" consists of three <concatenated entry>s itself, but has to be interpret as only one <concatenated entry> within its room.

6.2 Assignments Process

To identify assignments within a room, first, the leftmost token sequence that is an assignment has to be identified. The tokens of this assignment then are considered to be consumed for this assignment. Then, the next leftmost assignment has to be search, ignoring all tokens that already were consumed and until no more assignment can be identified.

<assignment piece> ::= <concatenation expression>
<equals sign> <ampersand> <concatenation expression>.

<assignment piece> ::= <concatenation expression>
<equals sign> <concatenation expression>.

When analyzing Unotal text to identify a <assignment piece>, the interpretation as a <concatenation expression> must not be chosen if another interpretation according to the preceding productions for <assignment piece> is possible.

<assignment piece> ::= <concatenation expression>.

6.3 Comments

To identify commens within a room, first, the leftmost token sequence that is a comment has to be identified. The tokens of this comment then are considered to be consumed for this comment. Then, the next leftmost comment has to be search, ignoring all tokes that already were consumed and until no more comment can be identified.

<comment piece> ::= <percent sign> <assignment piece>.

When analyzing Unotal text to identify a <comment piece>, the interpretation as a <assignment piece> must not be chosen if another interpretation according to the preceding production for <comment piece> is possible.

<comment piece> ::= <assignment piece>.

6.4 Namespaces

<namespace piece> ::= <exclamation mark> <comment piece>.

When analyzing Unotal text to identify a <assignment piece>, the following interpretation as an <assignment piece> must not be chosen if another interpretation according to the preceding production for <assignment piece> is possible.

<namespace piece> ::= <comment piece>.

6.5 Types

<type piece> ::= <ampersand> <namespace piece>.

<type piece> ::= <namespace piece>.

6.6 Rooms

<core piece> ::= <type piece> | <room>.

<core> ::= {<core piece>}.

<room> ::= <less-than sign> <core> <greater-than sign>.

6.7 Expressions

<expression> ::= <concatenation expression> | <room>.

TOC

Author's Address

	Stefan L. Ram
	Stefan L. Ram
EMail:	ram@zedat.fu-berlin.de
URI:	http://www.purl.org/stefan_ram/

TOC

Appendix A. Acknowledgements

The author gratefully acknowledges the use of RFC-2629 software that was written by M. Rose.

Weitere Seiten dieses Platzes behandeln beispielsweise Themen wie reader pub sowie sowie <a href="http://www.purl.org/stefan_ram/pub/c_de">Tutorial</a> und <a href="http://www.purl.org/stefan_ram/pub/null">Zahl</a>.