in Implementation

Simple Part of Speech Tagger: Some Design Decisions

I’m still working on the simple Part of Speech Tagger. It’s been a while since I posted anything about it so I wanted to document some of the design decisions.

Programming Paradigms and Languages

The first decision was the programming paradigm and language to use. Theoretically, the programming language should have little impact on an application’s design, but in reality, the programming language’s paradigm and capabilities influence many design decisions.

The Prolog programming language is based on formal first-order logic. Its declarative approach allows a natural expression of rules and knowledge making it a popular choice for rule based expert systems in the 1980s. Prolog also includes direct support for parsing through its Definite Clause Grammar rule notation.  Early work in Natural Language Processing, such as SRI’s Core Language Engine (PDF), used the Prolog programming language. Logic programming and Prolog were also the choice for the Computational Semantics text,  Representation and Inference for Natural Language. While Prolog is still well supported by Free software and open source projects, it has fallen out of favor as Artificial Intelligence and Natural Language Processing have moved away from first-order logic approaches.

Traditionally, Artificial Intelligence work, including Natural Language Processing, has used the functional programming paradigm. Like logic programming, functional programming is based on formal logic, the Lambda Calculus. First-order Logic on its own is not expressive enough to write useful programs so Prolog needs to include additional control structures. Functional programming languages can more purely implement Lambda Calculus. Proponents of functional programming feel that it expresses linguistic knowledge more intuitively. Current Natural Language projects using a functional programming approach favor the modern functional language, Haskell. The text, Computational Semantics with Functional Programming teaches Haskell programming while teaching Computational Semantics. The Grammatical Framework is a system for generating multilingual natural language parsers. It implements a grammar description language whose syntax borrows heavily from Haskell and is written in Haskell.

Finally, there are the programming language pragmatists who are not so concerned with programming paradigms or theoretical underpinnings. Pragmatists are more concerned with good support and wide acceptance of a programming language, in part so that their work is accessible to the widest possible audience. Projects in this group include the Python NLTK (Natural Language Tool Kit) and the Java-based Apache OpenNLP project and the Apache UIMA (Unstructured Information Management Architecture). Python and Java are two of the most widely used programming languages and are flexible enough to support a number of programming paradigms.

In the end, my decision is a pragmatic one. I chose Object Oriented programming in C++ because I am fluent in it. Professionally, I program almost exclusively in C but up until my current job I was writing most of my code in C++. I don’t want to lose that skill since it takes years of experience to become proficient in object oriented programming in C++. Plus  to understand all of the nuances of the C++ language and the standard library requires reading nearly 800 pages of Stroustup’s The C++ Programming Language. In addition the C++11 standard was finalized since I’ve become a full time C programmer and I need a project that will let me try out the new capabilities. The C++11 standard library includes a number of new features useful for natural language processing including regular expressions and Unicode support. Like Python, C++ is versatile enough to support multiparadigm programming. For example, the Castor C++ library provides logic programming support for C++. Finally Prolog, Python and Haskell provide mechanisms to integrate C/C++ so I can implement these interfaces later if I decide to take advantage of a higher level language.

Storing Results

I have three requirements for the output format for my Part of Speech Tagger.

  1. The results must be in a format that can be saved for later use or evaluation.
  2. The results must be available in a human-readable format so that I can evaluate the results easily.
  3. The results must be available in an efficient, structured format that can be used easily by other language processing components.

To meet these requirements I’ll store the results using XML. I’m also using a technique called stand-off annotation. Traditional annotated text corpora embed the tags with the text being annotated. In stand-off annotation the annotations are in a separate file and use character offsets to link the annotation with the text. The original text is never altered. This approach has several advantages. There may be multiple tags for a single word or group of words in the text which can be difficult to interpret when the tags are embedded with the text. Keeping the original text clean is necessary if the text is going to be analysed by successive processing steps. I was inspired to use XML stand-off annotation when experimenting with the Apache UIMA. The book, Natural Language Annotation for Machine Learning explores annotation techniques in detail and discusses the advantages of stand-off annotation.

Next time I’ll describe the object model for the Part of Speech tagger. The design is a bit more complex than needed for this first experiment but I’m planning ahead, trying to anticipate an architecture that can evolve into a more general natural language processing framework.