Quantcast
Channel: Mission Data Blog » darrend
Viewing all articles
Browse latest Browse all 10

Treetop: Grammar’s Cool

$
0
0

Treetop was one of the more exciting projects I saw at last year’s RubyConf. Nathan Sobo’s Treetop talk is available online and I urge you to watch it. Nathan did a great job of explaining the basics of syntactic analysis, and then got into the specifics of using Treetop’s implementation of parsing expression grammars to put the concepts to work.

Treetop appeared to gather all the concepts together into an understandable domain specific language. All of the tokenization and node structure can go into a single file, and the interactive nature of Ruby makes for the perfect sand box. I felt like I could get somewhere if I invested just an hour into this. I was happy to find that my impressions were correct.

After a short time I had caught on enough to start writing my own code. Once over the hump the rest was easy. I was able to write and test a Treetop grammar for parsing CSV files within a few hours. I chose CSV parsing because I was already familiar with the format, and I could compare my implementation to not just one but two existing Ruby libraries.

commaseparatedfile.treetop
grammar CommaSeparatedFile
  rule lines
    line (newline line)*
    {
      def values
        vals = []
        vals << line.values
        more.elements.each { |additional| vals << additional.line.values }
        vals
      end
    }
  end

  rule line
    valueline / emptyline
  end

  rule emptyline
    '' {
      def values
        []
      end
    }
  end

  rule valueline
    leading:(value separator)* trailing:value
    {
      def values
        list = []
        list += leading.elements.collect { |lead| lead.value.text_value } if leading
        list << trailing.text_value if trailing
        list
      end
    }
  end

  rule value
    quotedvalue / nakedvalue
  end

  rule quotedvalue
    '"' wrapped:( !'"' . / '""' )* '"' {
      def text_value
        wrapped.text_value.gsub(/""/,'"')
      end
    }
  end

  rule nakedvalue
    (!(separator / newline ) .)*
  end

  rule separator
    ','
  end

  rule newline
    [\n]
  end
end

One concept that became clear to me is that the tokens that are defined in the grammar are objects in the resulting tree returned after input is parsed. Ruby code embedded in the Treetop’s node definitions are added as methods to the object declared by that node.

In the simple harness below, the lines object returned from the parse call below is the object representation of the rule lines node above, defined as the root of the parse tree. The values method I call below is defined above in the definition of the lines node.

require 'rubygems'
require 'treetop'
require 'commaseparatedfile' 

parser = CommaSeparatedFileParser.new
lines = parser.parse("this,that,these,those")
puts lines.values.inspect #-> [["this", "that", "these", "those"]]

I think it was much simpler for me to write this parsing as a grammar instead of trying to combine regular expressions with Ruby code. The grammar above handles multiple lines and quote-wrapped values, and escaped quotes.

It was actually fun to write as well, and I’m still learning a lot. Notice that my lines node and my valueline node both approach a similar problem (a list of 0 or more items) differently. For the lines the grammar grabs one line followed by 0 or more separators followed by lines. For the valueline I grab 0 ore more values followed by separators and terminate on the last value. Is one approach better than the other? Are they both junk? I can’t wait to figure that out.

My poor parser is left in the dust by both Ruby’s stdlib csv library, and the fastercsv gem. I ran these tests under the 1.8.5 version of Ruby. The dataset I used is 44k in total and has 961 records.

                       user     system      total        real
stdlib_csv:        0.210000   0.000000   0.210000 (  0.217275)
faster_csv:        0.040000   0.000000   0.040000 (  0.060788)
treetop_grammar:   2.910000   0.010000   2.920000 (  3.126020)

I don’t know what sort of speed I should reasonably expect, actually. I’d hoped that my compiler would be competitive with the standard library’s csv parsing. It’s possible that my grammar’s definition could be innefficient. If anyone has any ideas let me know.

After playing around with Treetop I’ll definitely tinker some more. There are many other features I haven’t explored yet, like extending a base grammar to create a new grammar. PEGs have implementations in other languages so this isn’t a Ruby-only facility, either.

On a side note, I used stdlib’s benchmark to generate the numbers above. Pretty swank stuff.

require 'rubygems'
require 'treetop'
require 'commaseparatedfile'
require 'fastercsv'
require 'csv'
require 'benchmark'

def with_stdlib_csv(csv)
  parsed = CSV.parse(csv)
  parsed.size
end

def with_fastercsv(csv)
  rows = FasterCSV.parse(csv)
  rows.size
end

def with_treetop_parser(csv)
  parser = CommaSeparatedFileParser.new
  parsed = parser.parse(csv)
  parsed.values.size
end

csv = File.read('../kentucky.csv')

Benchmark.bmbm("treetop_parser:".length) do |x|
  x.report("stdlib_csv:") { with_stdlib_csv(csv) }
  x.report("faster_csv:") { with_fastercsv(csv) }
  x.report("treetop_grammar:") { with_treetop_parser(csv) }
end

Viewing all articles
Browse latest Browse all 10

Trending Articles