ANTLR4
The @choo-choo/parser-antlr package consumes ANTLR4 grammar source (.g4 files) and produces a ParsedGrammar — a flat list of named rules, each carrying a Diagram IR tree. This document is the spec the parser must implement; tests are written against it.
The parser targets a deliberately pragmatic, railroad-diagram-oriented subset of ANTLR4. The guiding principle: a real grammar pulled unchanged from antlr/grammars-v4 must tokenise and parse without errors, but constructs that have no visual meaning in a railroad diagram (actions, semantic predicates, lexer commands, options blocks, …) are recognised and silently skipped. Only syntactically broken source (unbalanced braces, unterminated strings, unknown operators) triggers an error.
Lexical structure
The tokenizer is a regex-based lexer with ordered rules, first-match-wins, plus a dedicated balanced-brace scanner that swallows any { … } block — actions, predicates, options/tokens/channels blocks, at-commands — without tokenising its contents. Whitespace and comments are recognised but discarded. Every produced token carries a SourceRange ({ start, end }) with offset, line, and column.
Token table
| Type | Pattern | Purpose |
|---|---|---|
| whitespace | \s+ | Ignored. |
| line comment | //[^\r\n]* | Ignored. |
| block comment | /\*[\s\S]*?\*/ (includes /** … */) | Ignored. Nested comments are not supported. |
COLON | : | Rule definition. |
SEMI | ; | Rule terminator / header terminator / statement terminator. |
PIPE | | | Alternation. |
LPAREN/RPAREN | ( / ) | Grouping. |
QUESTION | ? | Zero-or-one (also non-greedy suffix). |
STAR | * | Zero-or-more. |
PLUS | + | One-or-more. |
TILDE | ~ | Negated charset/atom. |
DOTDOT | .. | Character range (between two string literals). |
DOT | . | Wildcard. |
ASSIGN | = | Element label (x=INT). |
PLUS_ASSIGN | += | List-element label (xs+=INT). |
ARROW | -> | Start of lexer command. |
COMMA | , | Separator in lexer-command lists, in import A, B;. |
HASH | # | Start of an alt label (# LabelName). |
AT | @ | Start of an at-command (@header {…}). |
COLON_COLON | :: | At-command scope (@lexer::header). |
STRING | `’(?:\. | [^’\])*‘` |
CHARSET | `[(?:\. | [^]\])*]` |
IDENTIFIER | [A-Za-z_][A-Za-z0-9_]* | Rule/token reference and contextual keywords. |
Contextual keywords (i.e. IDENTIFIER values that the parser treats specially depending on position): grammar, parser, lexer, fragment, import, mode, options, tokens, channels.
Notes:
- Only
'…'string literals. ANTLR4 has no double-quoted string. - Character sets are tokenised as a single
CHARSETtoken whosevalueis the literal source text, brackets included. No expansion into individual alternatives. - Balanced-brace blocks (actions, predicates, at-command bodies, options/tokens/channels blocks) are consumed by a dedicated scanner that respects nested braces, string literals (single, double, and backtick), and comments. A trailing
?immediately after the closing}is consumed too (semantic predicates). - The tokenizer is case-sensitive. ANTLR’s uppercase-vs-lowercase identifier convention is not enforced at lex time; it’s meaningful only in the spec’s documentation of parser-vs-lexer rules, and reflected visually by the choice of rule name when rendered.
- An unmatched character raises a
GrammarSyntaxErrorwith its line and column. An unterminated{…}block, string literal, or block comment likewise.
Syntax (productions)
The parser is a handwritten recursive-descent parser with one-token lookahead. The grammar it implements, expressed in EBNF notation:
grammar-file = [ grammar-header ] , { top-level-item } ;
grammar-header = [ "parser" | "lexer" ] , "grammar" , IDENTIFIER , ";" ;
top-level-item = block-decl | at-command | import-stmt | mode-stmt | rule ;
block-decl = ( "options" | "tokens" | "channels" ) , (* { … } body already consumed by tokenizer *) ;
at-command = "@" , IDENTIFIER , [ "::" , IDENTIFIER ] , (* { … } body already consumed *) ;
import-stmt = "import" , IDENTIFIER , { "," , IDENTIFIER } , ";" ;
mode-stmt = "mode" , IDENTIFIER , ";" ;
rule = [ "fragment" ] , IDENTIFIER , ":" , alt-list , ";" ;
alt-list = alt , { "|" , alt } ; (* alternation *)
alt = element , { element } , [ "#" , IDENTIFIER ] (* sequence + optional alt label *) | (* empty *) ;
element = [ IDENTIFIER , ( "=" | "+=" ) ] , atom , [ suffix ] | lexer-command ;
suffix = ( "?" | "*" | "+" ) , [ "?" ] ; (* greedy + non-greedy *)
atom = STRING | IDENTIFIER | CHARSET | "." | "~" , atom | STRING , ".." , STRING (* lexer char range *) | "(" , alt-list , ")" ; (* grouping, transparent *)
lexer-command = "->" , lexer-cmd , { "," , lexer-cmd } ;lexer-cmd = IDENTIFIER , [ "(" , ( IDENTIFIER | INT ) , ")" ] ; (* fully skipped *)Precedence and associativity
Alternation (|) is outer, concatenation (juxtaposition) is inner. a b | c d parses as (a b) | (c d). Both are left-associative; the parser produces flat children arrays for sequence and choice.
Grouping
Parentheses are transparent: ( alt-list ) produces whatever the inner alt-list produces, without wrapping it in an extra IR node.
Labels
- Alt label (
… # LabelName): stripped silently. It names a generated visitor-class variant in ANTLR’s runtime; it has no visual meaning. - Element label (
x=INT,xs+=INT): stripped silently. Only the RHS is rendered.
Actions and predicates
{ target-language-code } embedded in a rule body, and { predicate }? following an element, are stripped silently by the tokenizer’s balanced-brace scanner before the parser ever sees them. The scanner handles arbitrary nesting and respects string literals inside the braces.
Lexer commands
Everything from -> up to the following ; (including comma-separated -> skip, channel(HIDDEN) lists) is consumed without producing IR.
Mapping to IR
| ANTLR construct | IR node |
|---|---|
'abc' (string literal, escapes decoded) | terminal with text = "abc" |
IDENTIFIER (upper or lower) | nonterminal with name = identifier-text |
EOF | nonterminal with name = "EOF" (treated as any other name) |
. (wildcard) | special with text = "." |
[a-zA-Z_] (charset) | special with text = "[a-zA-Z_]" (literal source text) |
~[abc] / ~'x' (negation) | special with text = "~[abc]" / "~'x'" |
'a'..'z' (char range) | special with text = "'a'..'z'" (verbatim) |
a b | sequence with children = [a, b] (flattened) |
a | b | choice with children = [a, b] (flattened) |
a? | optional with child = a, skip = "top" |
a* | optional(repetition(a)) — zero-or-more |
a+ | repetition with child = a — one-or-more |
a??, a*?, a+? | Same as greedy (non-greediness is a parse-time concern) |
( A ) | Whatever A produces (parentheses are transparent) |
| Alt label, element label, action, predicate, lexer command | (stripped — no IR) |
The rule production produces a Diagram whose child is the rule’s right-hand side. The rule’s name and source live on the enclosing GrammarRule, not on the Diagram.
Lexer tokens, parser rules, and fragment rules all land in ParsedGrammar.rules in source order. The three kinds are distinguished by convention (uppercase name = lexer, lowercase = parser, fragment prefix = fragment) but share the same IR shape. Consumers that want to render only parser rules can filter by name case themselves.
Source ranges
Every IR node produced by the parser carries a source?: SourceRange:
- Leaves (
terminal,nonterminal,special): the range of the originating token (including the quotes / brackets /.). - Composites (
sequence,choice,optional,repetition): the range spanning the first to the last token that contributed to the node. Diagram(per rule): the range spanning the rule’sIDENTIFIER(orfragmentkeyword) through the terminating;.GrammarRule.source: same as the Diagram’s source.
Errors
The parser throws GrammarSyntaxError (exported from @choo-choo/parser-utils) with a message and a position: SourcePosition. Cases:
- Unexpected character (tokenizer) — no regex rule matched and it’s not a
{. - Unterminated braced block — a
{…}is not closed before EOF. - Unterminated string literal — a
'…'runs to EOF without a closing quote. - Unterminated block comment — a
/* … */runs to EOF. - Unexpected end of input (parser) — lookahead is
nullwhen a token was expected. - Unexpected token (parser) — lookahead type doesn’t match the expected type.
The parser does not attempt recovery. The first error aborts parsing.
Examples
Smallest grammar
grammar Demo;one : '1' ;One rule "one"; its diagram’s child is terminal("1"). The grammar Demo; header is consumed and produces no IR.
Lexer + parser rules together
grammar Calc;expr : INT ('+' INT)* ;INT : [0-9]+ ;WS : [ \t\r\n]+ -> skip ;Three rules: expr, INT, WS. WS’s -> skip command is stripped; its diagram is repetition(special("[ \t\r\n]")).
Fragments, labels, actions, predicates
grammar Kitchen;stat : {doStuff();} lhs=ID '=' rhs=expr {x > 0}? ';' # AssignStat ;expr : INT | ID ;fragment DIGIT : [0-9] ;ID : [a-zA-Z_]+ ;stat’s diagram issequence(nt("ID"), terminal("="), nt("expr"), terminal(";"))— the action, predicate, element labels (lhs=,rhs=), and alt label (# AssignStat) are all stripped.fragment DIGITlands inParsedGrammar.rulesalongside the others.
Char ranges
grammar Chars;LETTER : [a-zA-Z] ;DIGIT : '0'..'9' ;LETTER’s child is special("[a-zA-Z]"); DIGIT’s child is special("'0'..'9'").