ParserKt
Introduction
ParserKt is a naive one-pass recursive descent, scannerless parser framework for Kotlin (mainly JVM, compatible for JS)
A parser is a kind of program extracting data from input sequences:
NOTE: Using REPL from command line is not a good practice, you can create a project with dependency on JitPack or use IDEA Kotlin REPL instead.
git clone https://github.com/ParserKt/ParserKt.git && cd ParserKt
gradle shadowJar
kotlinc -cp build/libs/ParserKt-*.jar
Don't be scared! ParserKt is just a small library, even full-build won't take too much time. (or you can download pre-built jars here)
// Pattern.read(String), CharInput.STDIN, ...
// item(x), elementIn('a'..'z'), satisfy<Int> { it % 2 == 0 }, ...
// asList(), asString(), ...
import org.parserkt.*
import org.parserkt.util.*
import org.parserkt.pat.*
Import parserkt
, util
, pat
, then we can implement our own combined pattern! (for a more in depth sample, see link at header)
val digits = Repeat(asList(), elementIn('0'..'9')) //{[0-9]}
digits.read("123") //[1, 2, 3]
digits.read("12a3") //[1, 2]
digits.read("a3") //null //(= notParsed)
A parser mainly recognize, and transform inputs into a simpler form (like AST) for post-processing (preetify, evaluate, type-check, ...), and it can be used to create (DSL) language tools / compilers / transpilers / interpreters.
// Generic parsing for [1, a, b] where (b > a)
val ints = Seq(::IntTuple, item(1), *Contextual(item<Int>()) { i -> satisfy<Int> { it > i } }.flatten().items() )
ints.rebuild(1,2,3) //[1, 2, 3]
ints.rebuild(1,2,1) //notParsed
i.rebuild(1,0,1) //[1, 0, 1]
// Using Convert patterns
fun digitItem(digit: SatisfyPattern<Char>) = Convert(digit, {it-'0'}, {'0'+it})
val digit = digitItem(elementIn('0'..'9'))
val digitsInt = Repeat(JoinFold(0) { this*10 + it }, digit)
digitsInt.read("233") //233
(click here if you want to try out more)
ParserKt divides complex grammar into simple subpart val
definitions. "entry rule" could be all public structure — like item
, elementIn
, Repeat
, and thus it's very easy to debug, and to reuse implemented syntax.
What means "one-pass"?
See FeedModel.kt:
interface Feed<out T> {
val peek: T; fun consume(): T
class End: NoSuchElementException("no more")
}
That's all one subparser can see, no mark/reset, not even one character supported for lookahead.
What means "recursive descent"?
Take a look of these expression: a + b
, a + (b + c)
sealed class PlusAst {
data class Plus(val left: PlusAst, val right: PlusAst): PlusAst()
data class IntLiteral(val n: Int): PlusAst()
}
(sealed class
are just classes with determinable subclass, in their inner name scope)
This data structure implies, every symbol of a + b
could be an integer (like 0
, 9
, 16
, ...IntLiteral(n)
), or other a + b
(1 + 2
, 9 + 0
, ...Plus(left, right)
)
PlusAst
is recursive, so it's best to implement it's parser in recurse function form, that's "recursive".
Plus := (Plus '+' Plus) | Int ; left associative
Int := {[0-9]}
({a}
denotes repeat, a|b
denotes alternative)
When reading input "1+2+3" by rule Plus
, results Plus(Plus(Int(1), Int(2)), Int(3))
.
Parser keep going to the state of reading substructure Int
from reading Plus
.
From rule Plus
to its subrule Int
, that's "descent".
——
ParserKt cannot make any lookaheads, so it's looks like impossible to parse syntax like //
, /*
, or 0x
0b
(and error when 0(octal)
).
In fact, "lookahead" can be stored in call stack of Pattern.read
by pattern Contextual
, so write parser for such syntax is possible (but much more complicated, so it's better to create new Pattern subclass, or fall back to tokenizer-parser LexerFeed
, or use TriePattern
)
What is the difference between "parser combinator" and "parser compiler"?
Let me clarify first: I really don't know what "combinator" is :(
I think combinators must be related to "functional programming", or at least "declarative programming" — Define what it is, not how computers actually solve it.
Parser combinator is mainly about code reuse, parser compiler is mainly about pattern matching algorithms.
Parser compilers and parser combinators solves the same problem in different ways. A parser compiler have better portability, and better performance (I'm not sure). A parser combinator integrates better with the host language, and it's really easy to write/debug, since it's just hand-wrote program.
For example, keywords about parser compilers: LL(k), LR(k), LALR, NFA, DFA, KMP, Aho–Corasick
Features
- Minimal boilerplate code & high reusability
- DSL style & fully object-oriented library design
- Rebuild parse result back to input (REPL friendly)
- Generic parsing input (Char, String, XXXToken, ...)
- One-pass input stream design without
hasNext
ReaderFeed
for implementating interactive shells (JVM only)- The framework manages parsed data in a flexible and extensible way (
Tuple2
, Sized-Tuple
,Fold
) - Supports complex contextual syntax structure that cannot be supported directly in parser compilers (
LayoutPattern
,NumUnitPattern
, ...) - Extensible error messages using
clam {"message"}
,clamWhile(pat, defaultValue) {"message"}
- Parser input stream
Feed
can have state storage:withState(T)
,stateAs<T>()
- <500K compiled JVM library, no extra runtime needed (except kotlin-stdlib, Kotlin sequences are optional)
- No magics, all code has been rewritten at least 9 times by the author, ugly/ambiguous/disordered codes are removed
Great thanks to Kotlin, for its strong expressibility and amazing type inference.
Provided combinators
Modules
- :parserkt for Feed/Input model and basic (
org.parserkt.pat
) / complex (org.parserkt.pat.complex
) patterns - :parserkt-util Fundamental data types (
Slice
,Tuple
,Trie
, ...) and helper class (Preety
,AnyBy
, ...) used/prepared for ParserKt and friends - :parserkt-ext Extension library for real-world parsers, eliminating boilerplate code (
LayoutPattern
,LexicalBasics
, ...)
abbreviations
- POPCorn (Pattern, OptionalPattern, PatternWrapper, ConstantPattern)
- SURDIES (Seq, Until, Repeat, Decide, item, elementIn, satisfy)
- CCDP (Convert, Contextual, Deferred, Piped)
- SJIT (SurroundBy, JoinBy, InfixPattern, TriePattern)
More runnable REPL snippets
Note that for frequently-used pattern combinations, we have org.parserkt.pat.ext
:
// Using pat.ext and LexicalBasics
import org.parserkt.pat.ext.*
val digitsInt = Repeat(asInt(), LexicalBasics.digitFor('0'..'9'))
digitsInt.read("180") //180
// Implementing complex pattern
import org.parserkt.pat.complex.*
// Patterns with constant values are OK to perform rebuild, even ignored in parse result
val letter = elementIn('A'..'Z', 'a'..'z', '0'..'9') or item('_')
val sep = elementIn(',',':').toConstant(',') named "sep"
val xsv = JoinBy(sep, Repeat(asString(), letter)) // try replace letter using !sep
xsv.read("monkey,banana,melon") //Tuple2(first=[monkey, banana, melon], second=[,, ,])
xsv.concatCharJoin().read("猴:雀:瓜") //Tuple2(first=[猴, 雀, 瓜], second=::)
val goodXsv = xsv.mergeConstantJoin()
goodXsv.read("she,her,herself") //[she, her, herself]
goodXsv.rebuild("she,her,herself") //she,her,herself
ParserKt provides mergeFirst/Second
, discardFirst/Second
, flatten
for Tuple2
pattern converting
import org.parserkt.*
import org.parserkt.util.*
import org.parserkt.pat.*
import org.parserkt.pat.complex.*
val dict = TriePattern<Char, String>().apply {
mergeStrings("hello" to "こんにちは")
mergeStrings("world" to "世界")
}
val noun = Repeat(asList(), dict)
noun.read("helloworld") //[こんにちは, 世界]
val pharse = JoinBy(Decide(elementIn('0'..'9'), elementIn(' ', '\t', '\n', '\r')), dict)
pharse.read("hello6world hello") //Tuple2(first=[こんにちは, 世界, こんにちは], second=[Tuple2(first=0, second=6), Tuple2(first=1, second= )])
It's OK (and also easy) to read recursive structure, just defined lateinit var
and use Deferred{name}
to reference:
// Back converts (third argument for Convert) are optional
sealed class Sexp { data class Term(val name: String): Sexp(); data class Nest(val list: List<Sexp>): Sexp() }
lateinit var sexp: Pattern<Char, Sexp>
val str = Repeat(asString(), !elementIn(' ', '(', ')'))
val atom = Convert(str, { Sexp.Term(it) }, { it.name })
val nestItems = SurroundBy(parens.toCharPat(), JoinBy(item(' '), Deferred{sexp}).mergeConstantJoin())
val nest = Convert(nestItems, { Sexp.Nest(it) }, { it.list })
sexp = Decide(nest, atom).mergeFirst { if (it is Sexp.Nest) 0 else 1 }
sexp.read("((monkey banana) (deep parser))") //Nest(list=[Nest(list=[Term(name=monkey), Term(name=banana)]), Nest(list=[Term(name=deep), Term(name=parser)])])
sexp.rebuild("((monkey banana) (deep parser))") //((monkey banana) (deep parser))
NOTE: when using pattern
Until
, think if it can be expressed byRepeat(..., !SatisfPattern)
first
References
- GitHub: Simple calculator in Haskell
- Blog post: 看完这段 Kotlin 代码后我哭了
- Blog post: 简单 Kotlin 解析器
- Article: Hello, declarative world
- Article: Parsing in JavaScript: Tools and Libraries
References on Historic
- Origin for Feed stream (1)
- Origin for Feed stream (2)
- Origin for Feed stream (3) - JavaScript
- Origin for Feed stream (4) - Python
- Origin for "pattern" model
- Origin for NumOps, RangeMap
(script for GitHub pages) <script src="https://parserkt.github.io/scripts/github_corner.js"></script>