Parsing regular records with a stackless state machine

Posted on November 16, 2015 by Vincent Engelmann

My uncle won a project to implement White Space broadband in Thurman, NY. One of the many interesting challenges was to determine where lots of “off the grid” residences existed. This would help him plan where he could put his whitespace repeaters.

We thought a bit about what information we had access to and determined that the tax rolls of the county would be useful. But we soon learned they were only available in semi-structured PDFs. This is how we parsed them:

# ...
START   = 's'
CONTENT = 'c'

    def parseFile(self, filename):
        state = START
        recStart = re.compile('\*+') 
        self.f = None
        self.f = open(filename, 'r')
        for line in self.f:
            if state is START and recStart.match(line):
                state = 'CONTENT'
                self.addressLines = []
                self.addressLines.append(self.extractId(line))
                self.addressLines.append(line)
            elif state is CONTENT and  not recStart.match(line):
                self.addressLines.append(line)
            elif state is CONTENT and  recStart.match(line):
                state = START
                self.addressList.append(self.addressLines)
                self.addressLines = []
                self.addressLines.append(self.extractId(line))
                self.addressLines.append(line)
            elif state == START and not recStart.match(line):
                state = CONTENT
                self.addressLines.append(line)
            else:
                print "No valid state for given line."
    def extractId(self,line):
        return self.recIdRgx.match(line).group(1)