The design for my simple Part of Speech tagger is a bit more complex than what is required for this simple task. However, I’m trying to create an architecture that I can extend in the future to encompass additional tasks. I am designing the C++ classes for this project so that they can be reused and extended, saving me time and effort in future projects. The class descriptions below also include my thoughts about how the classes may be enhanced to support more complex tasks. I’ll skip the usual UML class diagrams because I find they don’t contribute much to understanding the design.
PosAnnotator The PosAnnotator class is a container for all of the component objects that make up the Part of Speech tagger. Its constructor creates all of the component classes that exist for the lifetime of the program. It also contains the main processing function for the annotator.
The PosAnnotator contains a TextReader object, a C++ vector of AnnotationRule objects, and a C++ map used to hold PosAnnotation objects. A C++ map is like a dictionary. It is an ordered container class that can also be randomly accessed using a key. In this case the key used for ordering and access is the character position offset for the first character of the word associated with the annotation.
The PosAnnotator uses the TextReader to obtain a buffer of text. Each rule in the vector of rules is applied to the text buffer. The first rule is a default rule that tokenizes the text, identifying the character offsets for each word and classifying each word’s category as unknown. When a rule matches a word it creates a PosAnnotation object and places it in the map of annotations. After all of the rules have been applied, the annotator passes the annotation map to the AnnotationWriter to write the annotations to a file.
In a larger natural language processing application with several kinds of annotators, I would use an Annotator abstract base class and PosAnnotator would be a subclass.
TextReader In this implementation TextReader simply reads a specified range of text from a file of raw UTF-8 text into a buffer. In future implementations TextReader would be a base class and the derived classes would have provide sophisticated services. For example, my goal is to process all text using wide characters to support languages that require them. One of the TextReader services would be to normalize all data to 16 bit Unicode. Other services would include accessing text from resources on the Internet and parsing HTML, XML, and PDF documents.
AnnotatorRule The AnnotatorRule class is an abstract base class for the rules that classify words into part of speech categories. The rules do the real work of the annotator, processing the text and creating annotations. There are three kinds of rules based on my goal of applying the material in Carnie’s Syntax: morphological rules based on suffixes of words, syntactic rules based on the categories of surrounding words, and rules to match words in closed categories which contain a small number of unchanging words such as conjunctions and determiners. For implementation purposes I have added a fourth kind, the default rule.
Each rule is a C++ class and is derived from AnnotatorRule. The morphological rules, and the default rule are implemented using C++11 regular expressions (regex). The closed class rules are implemented using C++ maps. Each rule is given access to the annotation map. This allows the syntactic rules to access the categories for words surrounding the word being analyzed.
PosAnnotation The PosAnnotation class contains the results of the AnnotatorRule objects’ work. It contains of a span which is the character offsets for the start and end of the word and the word category.
PosCategory The PosCategory represents the part of speech category for each word. There is a derived class for each category. Currently the PosCategory is implemented using the Flyweight design pattern but at this point there isn’t much functionality associated with the PosCategory classes so I may just implement the categories as an enumeration.
AnnotationWriter The AnnotationWriter creates an XML document from a map of annotations along with information about the text document that was analyzed. The AnnotationWriter class will probably evolve into a base class and derived class will be designed for different types of annotations. Other derivations of the AnnotationWriter could generate the traditional embedded annotation in documents or more human readable HTML documents.
The actual code is mostly done. The last major part is the AnnotationWriter. I’m trying to decide the best way to generate the XML document. I’ve looked at the Boost Serialization Library which can generate XML but it only generates a very raw representation of an object’s data and is not very portable. I’m also evaluating the Apache Xerces-C++ library.
Next Steps Once my simple Part of Speech Tagger is working I plan to test it against the Brown Corpus, available from the NLTK Corpus download page.
I also want to experiment with another Part of Speech tagger, The Brill Tagger(PDF). The Brill Tagger is interesting in that it combines training with a rule based approach. There are two passes for training. In the first pass it records the part of speech category for each word with a count for each time the word is assigned a particular category. In the second pass, each word is assigned its most commonly used category. For the cases where this is not the correct category, the tagger creates a rule to reassign the word to the correct category.
I have two goals in implementing the Brill Tagger. My first goal is to evaluate whether my software architecture is versatile and extensible by implementing a significantly different Part of Speech tagger. My second goal is I’d like to see what the final rules generated by the Brill Tagger’s training phase look like. I’m curious if they tell us anything about how word categories are determined in natural language.