EBNF (ISO/IEC 14977)
The @choo-choo/parser-ebnf package consumes ISO/IEC 14977 Extended Backus–Naur Form source and produces a ParsedGrammar — a list of named rules, each carrying a Diagram IR tree. This document is the spec the parser must implement; tests are written against it.
The parser intentionally supports a pragmatic subset of ISO 14977. Features excluded in 0.1 are listed at the end.
Lexical structure
The tokenizer is a regex-based lexer with ordered rules, first-match-wins. Whitespace and comments are recognised but discarded. Every produced token carries a SourceRange ({ start, end }) with offset, line, and column.
Token table
| Type | Pattern | Purpose |
|---|---|---|
whitespace | \s+ | Ignored. |
comment | \(\* … \*\) (non-greedy, . matches newline) | Ignored. Nested comments are not supported. |
= | = | Rule definition. |
(, ) | \(, \) | Grouping. |
{, } | \{, \} | Repetition (zero-or-more). |
[, ] | \[, \] | Optional. |
| | | | Alternation. |
, | , | Concatenation. |
; | ; | Rule terminator. |
special | \?[^?]+\? | User-defined special sequence (verbatim text). |
terminal | "[^"]+" or '[^']+' | Literal token (quotes stripped). |
identifier | [A-Za-z][A-Za-z0-9 ]*[A-Za-z0-9] or a single letter | Non-terminal name. Internal spaces allowed; first and last character must be alphanumeric. |
Notes:
- Terminals accept either
"…"or'…'. (The legacy project used backticks; we move to single quotes to match ISO 14977.) - Identifiers may contain interior spaces, matching ISO 14977 (
meta identifier = letter, { letter | decimal digit | space }) — this lets users writeright hand side = …. The regex disallows a trailing space. - Keywords are operators; there are no reserved word identifiers.
- The tokenizer is case-sensitive.
- An unmatched character raises a
GrammarSyntaxErrorwith its line and column.
Syntax (productions)
The parser is a handwritten recursive-descent parser with a one-token lookahead. The grammar it implements:
grammar = { rule } ;
rule = identifier , "=" , definitions-list , ";" ;
definitions-list = single-definition , { "|" , single-definition } ; (* alternation *)
single-definition = syntactic-term , { "," , syntactic-term } ; (* concatenation *)
syntactic-term = syntactic-primary ;
syntactic-primary = identifier | terminal | optional | repetition | grouped | special ;
optional = "[" , definitions-list , "]" ;
repetition = "{" , definitions-list , "}" ;
grouped = "(" , definitions-list , ")" ;
special = special-token ; (* the ?…? literal *)Precedence
Per ISO 14977, alternation (|) is the outer operator and concatenation (,) is the inner operator. So:
a , b | c , dis parsed as (a , b) | (c , d), not a , (b | c) , d. This is a deliberate departure from the legacy project, whose parser inverted the precedence.
Associativity
Both | and , are left-associative, mirroring the standard. The parser produces flat children arrays for these operators (i.e. a, b, c yields a single three-child sequence in the IR, not a left-leaning binary tree).
Rule terminator
Rules end with ;. The standard also permits . as a terminator; we do not accept . in 0.1 (trivially extensible later by adding a token rule).
Whitespace
Whitespace is free between tokens. There is no layout rule.
Mapping to IR
Every EBNF construct maps to IR nodes with source populated:
| EBNF construct | IR node |
|---|---|
"x" or 'x' | terminal with text = the quoted content (quotes stripped) |
identifier | nonterminal with name = the identifier text |
? text ? | special with text = the content, not trimmed |
A , B | sequence with children = [A, B] (flattened) |
A | B | choice with children = [A, B] (flattened) |
[ A ] | optional with child = A, skip = "top" |
{ A } | optional wrapping repetition — i.e. zeroOrMore(A) |
( A ) | whatever A produces (parentheses are structural only) |
The rule production produces a Diagram whose child is the rule’s right-hand side and whose start / end default to simple. The rule’s name and source live on the enclosing GrammarRule, not on the Diagram.
Notes on the mapping:
{ A }is zero-or-more in ISO EBNF. We preserve that semantics by desugaring tooptional(repetition(A))at parse time, matching the manual builder’szeroOrMoreoutput. There is no standalone “zero-or-more” IR kind.- Group
( A )is transparent. The spec’s informalgroupIR node is not produced by this parser — it is reserved for the manual builder’s labelled visual grouping. If we later add labelled groups to EBNF, they’d go through aspecialor a new token. - Single-child
sequence/choicenever appear in the output: a single-child group parses to its inner node verbatim, per the builder’s unwrapping rule.
Source ranges
Every IR node produced by the parser carries a source?: SourceRange:
- Leaves (
terminal,nonterminal,special): the range of the originating token (including quotes /?delimiters). - Composites (
sequence,choice,optional,repetition): the range spanning the first to the last token that contributed to the node (e.g. for{ A }the range covers{through}). Diagram(per rule): the range spanning the rule’sidentifierthrough the terminating;.GrammarRule.source: same as the Diagram’s source.
Errors
The parser throws GrammarSyntaxError (exported from @choo-choo/parser-utils) with a message and a position: SourcePosition. Cases:
-
Unexpected character (tokenizer) — no regex rule matched at the current position. Message:
unexpected character "X" at line L, column C. -
Unexpected end of input (parser) — lookahead is
nullwhen a token was expected. Message:unexpected end of input; expected <type>. -
Unexpected token (parser) — lookahead type doesn’t match the expected type. Message:
unexpected <actualType> at line L, column C; expected <expectedType>. -
Invalid right-hand side start (parser) — the token at the start of a
syntactic-primaryisn’t one of the accepted kinds. Message:unexpected <type> at line L, column C; expected an identifier, terminal, "[", "{", "(", or special.
The parser does not attempt recovery. The first error aborts parsing.
Examples
A single rule
zero = "0";Produces a ParsedGrammar with one rule "zero" whose diagram wraps terminal("0").
Alternation
binary digit = "0" | "1";One rule "binary digit", diagram child: choice(terminal("0"), terminal("1")).
Concatenation + alternation (precedence check)
expr = a , b | c , d;Parses as choice(sequence(nt("a"), nt("b")), sequence(nt("c"), nt("d"))) (ISO precedence — | outer, , inner).
Optional and repetition
arguments = argument , { "," , argument };call = identifier , "(" , [ arguments ] , ")";{ "," , argument }→zeroOrMore(sequence(terminal(","), nt("argument")))[ arguments ]→optional(nt("arguments"))
Comment
(* identifiers start with a letter *)name = letter , { letter | digit };The comment is skipped by the tokenizer; the rule’s Diagram.source covers only name = … ;.
Special
regex atom = ? any Perl-compatible regular expression ?;Produces a special node with text = " any Perl-compatible regular expression " (spaces preserved).