Treetop was one of the more exciting projects I saw at last year’s RubyConf. Nathan Sobo’s Treetop talk is available online and I urge you to watch it. Nathan did a great job of explaining the basics of syntactic analysis, and then got into the specifics of using Treetop’s implementation of parsing expression grammars to put the concepts to work.
Treetop appeared to gather all the concepts together into an understandable domain specific language. All of the tokenization and node structure can go into a single file, and the interactive nature of Ruby makes for the perfect sand box. I felt like I could get somewhere if I invested just an hour into this. I was happy to find that my impressions were correct.
After a short time I had caught on enough to start writing my own code. Once over the hump the rest was easy. I was able to write and test a Treetop grammar for parsing CSV files within a few hours. I chose CSV parsing because I was already familiar with the format, and I could compare my implementation to not just one but two existing Ruby libraries.
commaseparatedfile.treetop
grammar CommaSeparatedFile
rule lines
line (newline line)*
{
def values
vals = []
vals << line.values
more.elements.each { |additional| vals << additional.line.values }
vals
end
}
end
rule line
valueline / emptyline
end
rule emptyline
'' {
def values
[]
end
}
end
rule valueline
leading:(value separator)* trailing:value
{
def values
list = []
list += leading.elements.collect { |lead| lead.value.text_value } if leading
list << trailing.text_value if trailing
list
end
}
end
rule value
quotedvalue / nakedvalue
end
rule quotedvalue
'"' wrapped:( !'"' . / '""' )* '"' {
def text_value
wrapped.text_value.gsub(/""/,'"')
end
}
end
rule nakedvalue
(!(separator / newline ) .)*
end
rule separator
','
end
rule newline
[\n]
end
end
One concept that became clear to me is that the tokens that are defined in the grammar are objects in the resulting tree returned after input is parsed. Ruby code embedded in the Treetop’s node definitions are added as methods to the object declared by that node.
In the simple harness below, the lines
object returned from the parse call below is the object representation of the rule lines
node above, defined as the root of the parse tree. The values
method I call below is defined above in the definition of the lines
node.
require 'rubygems'
require 'treetop'
require 'commaseparatedfile'
parser = CommaSeparatedFileParser.new
lines = parser.parse("this,that,these,those")
puts lines.values.inspect #-> [["this", "that", "these", "those"]]
I think it was much simpler for me to write this parsing as a grammar instead of trying to combine regular expressions with Ruby code. The grammar above handles multiple lines and quote-wrapped values, and escaped quotes.
It was actually fun to write as well, and I’m still learning a lot. Notice that my lines
node and my valueline
node both approach a similar problem (a list of 0 or more items) differently. For the lines the grammar grabs one line followed by 0 or more separators followed by lines. For the valueline I grab 0 ore more values followed by separators and terminate on the last value. Is one approach better than the other? Are they both junk? I can’t wait to figure that out.
My poor parser is left in the dust by both Ruby’s stdlib csv library, and the fastercsv gem. I ran these tests under the 1.8.5 version of Ruby. The dataset I used is 44k in total and has 961 records.
user system total real stdlib_csv: 0.210000 0.000000 0.210000 ( 0.217275) faster_csv: 0.040000 0.000000 0.040000 ( 0.060788) treetop_grammar: 2.910000 0.010000 2.920000 ( 3.126020)
I don’t know what sort of speed I should reasonably expect, actually. I’d hoped that my compiler would be competitive with the standard library’s csv parsing. It’s possible that my grammar’s definition could be innefficient. If anyone has any ideas let me know.
After playing around with Treetop I’ll definitely tinker some more. There are many other features I haven’t explored yet, like extending a base grammar to create a new grammar. PEGs have implementations in other languages so this isn’t a Ruby-only facility, either.
On a side note, I used stdlib’s benchmark to generate the numbers above. Pretty swank stuff.
require 'rubygems'
require 'treetop'
require 'commaseparatedfile'
require 'fastercsv'
require 'csv'
require 'benchmark'
def with_stdlib_csv(csv)
parsed = CSV.parse(csv)
parsed.size
end
def with_fastercsv(csv)
rows = FasterCSV.parse(csv)
rows.size
end
def with_treetop_parser(csv)
parser = CommaSeparatedFileParser.new
parsed = parser.parse(csv)
parsed.values.size
end
csv = File.read('../kentucky.csv')
Benchmark.bmbm("treetop_parser:".length) do |x|
x.report("stdlib_csv:") { with_stdlib_csv(csv) }
x.report("faster_csv:") { with_fastercsv(csv) }
x.report("treetop_grammar:") { with_treetop_parser(csv) }
end