Embedded Systems Software Development Processes

Embedded systems software would be a large part of a robot brain’s software, carrying out the low level functions of controlling manipulators or processing sensory data. In fact, a design for a robot brain may include a network of specialized embedded systems processors. So sometimes I’ll depart from my primary line of enquiry, language, and talk about embedded systems software and projects for these lower level functions.

Most of my career, I’ve developed embedded systems software for telecommunications equipment, defense systems, and now network security. The management/user interface code for these systems is not always mission critical but the embedded software always is.

We don’t all write code with one shot at getting it right, like sending spacecraft to Mars, but embedded software errors can cost money and even cause personal injury or death. I’ve seen misguided attempts to apply Agile Methods to embedded software but they don’t support the level of quality and reliability required for mission critical systems.

The article Mars Code, in the February 2014 issue of Communications of the ACM describes the processes that JPL used for developing the software for the Curiosity Mars mission. NASA/JPL’s processes may be extreme for earth-bound systems but there are some good ideas here that could be applied to other mission-critical embedded systems software.

Funnies: Robots and Language

My day job, protecting networks from evil-doers, has kept me away from working on Robot Brains this week so I’ll just post a couple of funnies.

I love Sarcastic Rover, the biggest baddest atomic robot on Mars.

 

XKCD about words from this week. Try this in your NLP application. ;-)

Simple Part of Speech Tagger: Object Model

The design for my simple Part of Speech tagger is a bit more complex than what is required for this simple task. However, I’m trying to create an architecture that I can extend in the future to encompass additional tasks. I am designing the C++ classes for this project so that they can be reused and extended, saving me time and effort in future projects. The class descriptions below also include my thoughts about how the classes may be enhanced to support more complex tasks. I’ll skip the usual UML class diagrams because I find they don’t contribute much to understanding the design.

PosAnnotator: The PosAnnotator class is a container for all of the component objects that make up the Part of Speech tagger. Its constructor creates all of the component classes that exist for the lifetime of the program. It also contains the main processing function for the annotator.

The PosAnnotator contains a TextReader object, a C++ vector of AnnotationRule objects, and a C++ map used to hold PosAnnotation objects. A C++ map is like a dictionary. It is an ordered container class that can also be randomly accessed using a key. In this case the key used for ordering and access is the character position offset for the first character of the word associated with the annotation.

The PosAnnotator uses the TextReader to obtain a buffer of text. Each rule in the vector of rules is applied to the text buffer. The first rule is a default rule that tokenizes the text, identifying the character offsets for each word and classifying each word’s category as unknown. When a rule matches a word it creates a PosAnnotation object and places it in the map of annotations.  After all of the rules have been applied, the annotator passes the annotation map to the AnnotationWriter to write the annotations to a file.

In a larger natural language processing application with several kinds of annotators, I would use an Annotator abstract base class and PosAnnotator would be a subclass.

TextReader: In this implementation, TextReader simply reads a specified range of text from a file of raw UTF-8 text into a buffer. In future implementations, TextReader would be a base class and the derived classes would have provide sophisticated services. For example, my goal is to process all text using wide characters to support languages that require them. One of the TextReader services would be to normalize all data to 16 bit Unicode. Other services would include accessing text from resources on the Internet and parsing HTML, XML, and PDF documents.

AnnotatorRule: The AnnotatorRule class is an abstract base class for the rules that classify words into part of speech categories. The rules do the real work of the annotator, processing the text and creating annotations. There are three kinds of rules based on my goal of applying the material in Carnie’s Syntax: morphological rules based on suffixes of words, syntactic rules based on the categories of surrounding words, and rules to match words in closed categories which contain a small number of unchanging words such as conjunctions and determiners. For implementation purposes, I have added a fourth kind: the default rule.

Each rule is a C++ class and is derived from AnnotatorRule. The morphological rules and the default rule are implemented using C++11 regular expressions (regex). The closed class rules are implemented using C++ maps. Each rule is given access to the annotation map. This allows the syntactic rules to access the categories for words surrounding the word being analyzed.

PosAnnotation: The PosAnnotation class contains the results of the AnnotatorRule objects’ work. It contains a span which is the character offsets for the start and end of the word and the word category.

PosCategory: The PosCategory represents the part of speech category for each word. There is a derived class for each category. Currently the PosCategory is implemented using the Flyweight design pattern but at this point there isn’t much functionality associated with the PosCategory classes so I may just implement the categories as an enumeration.

AnnotationWriter: The AnnotationWriter creates an XML document from a map of annotations along with information about the text document that was analyzed. The AnnotationWriter class will probably evolve into a base class and derived class will be designed for different types of annotations. Other derivations of the AnnotationWriter could generate the traditional embedded annotation in documents or more human readable HTML documents.

The actual code is mostly done. The last major part is the AnnotationWriter. I’m trying to decide the best way to generate the XML document. I’ve looked at the Boost Serialization Library which can generate XML but it only generates a very raw representation of an object’s data and is not very portable. I’m also evaluating the Apache Xerces-C++ library.

Next Steps: Once my simple Part of Speech Tagger is working I plan to test it against the Brown Corpus, available from the NLTK Corpus download page.

I also want to experiment with another Part of Speech tagger, The Brill Tagger(PDF). The Brill Tagger is interesting in that it combines training with a rule based approach. There are two passes for training. In the first pass, it records the part of speech category for each word with a count for each time the word is assigned a particular category. In the second pass, each word is assigned its most commonly used category. For the cases where this is not the correct category, the tagger creates a rule to reassign the word to the correct category.

I have two goals in implementing the Brill Tagger. My first goal is to evaluate whether my software architecture is versatile and extensible by implementing a significantly different Part of Speech tagger. My second goal is I’d like to see what the final rules generated by the Brill Tagger’s training phase look like. I’m curious if they tell us anything about how word categories are determined in natural language.

Simple Part of Speech Tagger: Some Design Decisions

I’m still working on the simple Part of Speech Tagger. It’s been a while since I posted anything about it so I wanted to document some of the design decisions.

Programming Paradigms and Languages

The first decision was the programming paradigm and language to use. Theoretically, the programming language should have little impact on an application’s design, but in reality, the programming language’s paradigm and capabilities influence many design decisions.

The Prolog programming language is based on formal first-order logic. Its declarative approach allows a natural expression of rules and knowledge making it a popular choice for rule based expert systems in the 1980s. Prolog also includes direct support for parsing through its Definite Clause Grammar rule notation.  Early work in Natural Language Processing, such as SRI’s Core Language Engine (PDF), used the Prolog programming language. Logic programming and Prolog were also the choice for the Computational Semantics text,  Representation and Inference for Natural Language. While Prolog is still well supported by Free software and open source projects, it has fallen out of favor as Artificial Intelligence and Natural Language Processing have moved away from first-order logic approaches.

Traditionally, Artificial Intelligence work, including Natural Language Processing, has used the functional programming paradigm. Like logic programming, functional programming is based on formal logic, the Lambda Calculus. First-order Logic on its own is not expressive enough to write useful programs so Prolog needs to include additional control structures. Functional programming languages can more purely implement Lambda Calculus. Proponents of functional programming feel that it expresses linguistic knowledge more intuitively. Current Natural Language projects using a functional programming approach favor the modern functional language, Haskell. The text, Computational Semantics with Functional Programming teaches Haskell programming while teaching Computational Semantics. The Grammatical Framework is a system for generating multilingual natural language parsers. It implements a grammar description language whose syntax borrows heavily from Haskell and is written in Haskell.

Finally, there are the programming language pragmatists who are not so concerned with programming paradigms or theoretical underpinnings. Pragmatists are more concerned with good support and wide acceptance of a programming language, in part so that their work is accessible to the widest possible audience. Projects in this group include the Python NLTK (Natural Language Tool Kit) and the Java-based Apache OpenNLP project and the Apache UIMA (Unstructured Information Management Architecture). Python and Java are two of the most widely used programming languages and are flexible enough to support a number of programming paradigms.

In the end, my decision is a pragmatic one. I chose Object Oriented programming in C++ because I am fluent in it. Professionally, I program almost exclusively in C but up until my current job I was writing most of my code in C++. I don’t want to lose that skill since it takes years of experience to become proficient in object oriented programming in C++. Plus  to understand all of the nuances of the C++ language and the standard library requires reading nearly 800 pages of Stroustup’s The C++ Programming Language. In addition the C++11 standard was finalized since I’ve become a full time C programmer and I need a project that will let me try out the new capabilities. The C++11 standard library includes a number of new features useful for natural language processing including regular expressions and Unicode support. Like Python, C++ is versatile enough to support multiparadigm programming. For example, the Castor C++ library provides logic programming support for C++. Finally Prolog, Python and Haskell provide mechanisms to integrate C/C++ so I can implement these interfaces later if I decide to take advantage of a higher level language.

Storing Results

I have three requirements for the output format for my Part of Speech Tagger.

  1. The results must be in a format that can be saved for later use or evaluation.
  2. The results must be available in a human-readable format so that I can evaluate the results easily.
  3. The results must be available in an efficient, structured format that can be used easily by other language processing components.

To meet these requirements I’ll store the results using XML. I’m also using a technique called stand-off annotation. Traditional annotated text corpora embed the tags with the text being annotated. In stand-off annotation the annotations are in a separate file and use character offsets to link the annotation with the text. The original text is never altered. This approach has several advantages. There may be multiple tags for a single word or group of words in the text which can be difficult to interpret when the tags are embedded with the text. Keeping the original text clean is necessary if the text is going to be analysed by successive processing steps. I was inspired to use XML stand-off annotation when experimenting with the Apache UIMA. The book, Natural Language Annotation for Machine Learning explores annotation techniques in detail and discusses the advantages of stand-off annotation.

Next time I’ll describe the object model for the Part of Speech tagger. The design is a bit more complex than needed for this first experiment but I’m planning ahead, trying to anticipate an architecture that can evolve into a more general natural language processing framework.

Time to Build: Developing a Part of Speech Tagger

Hackers generally learn best by either taking things apart or by building things. So after reading an introduction to Parts Of Speech in Carnie’s Syntax, it’s time to build something, a toy Part Of Speech Tagger for English.

When we think of parts of speech we usually use semantic roles. Nouns are places, people, or things. Verbs are actions. But when studying Generative Grammar we assign parts of speech according to syntactic roles. Determining parts of speech can be a bit of a chicken and egg problem. On the one hand we need to know the parts of speech assigned each word in order to parse a sentence. On the other hand syntactic rules provide clues for identifying each word’s part of speech assignment. In fact, syntactic role is the final determiner for part of speech since many words take on different identities depending on their context.

Fortunately, there are other clues to a word’s part of speech. First is the structure of the word itself or its morphology. In linguistics, morphology is the study of how words are constructed from smaller components called morphemes. For example, words ending in -ment are nouns such as basement. Plural nouns take on the suffix -s or -es as in deaths and taxes.

Most words belong to what are called open classes. Open classes are large and easily take on new words. There are also the closed classes that are relatively small and rarely take on new words. The closed class word include: Determiner, Preposition, Conjunction, Tense Marker, Negation, Complement. We can assign a part of speech to these words using a direct match then use some basic syntactic rules to assign parts of speech to nearby words.

The Part Of Speech tagger will use a combination closed classes, morphological analysis, and syntactic rules as described in Carnie’s Syntax. An introductory chapter on word categories is by no means comprehensive. So a Part of Speech tagger based on this limited information will be a toy or, at best, a prototype but building this application will reinforce what I’ve learned and perhaps provide base that I can build on. Once I’ve completed the design and written the code I’ll publish them here.

Noam Chomsky and Generative Grammar

You can read about Noam Chomsky and Generative Grammar on Wikipedia or in  linguistics textbooks but I want to explain why I’m studying Chomsky and why his Generative Grammar is important for my project.

In the late 1950s, Noam Chomsky revolutionized linguistics by applying formal languages, as defined in mathematics, to the study of natural languages. Chomsky’s initial theory, Transformational Grammar, posited two representations for sentences in a language: a deep structure and a surface structure. The deep structure corresponds to the semantic meaning of a sentence, the representation we use for reasoning. The surface structure corresponds to the phonological language that we actually speak or hear. The grammar defines a set of transformational rules that map sentences between these two forms. So theories of grammar are central to the understanding of how natural languages work.

In the 1960s and 1970s Chomsky’s approach was questioned by many of his students and colleagues leading to what has been called “The Linguistics Wars”. A number of alternative theories of grammar branched off including Lexical Functional Grammar (LFG) and Head Driven Phrase Structure Grammar (HPSG). Chomsky refined his theory since then. His current approach is called The Minimalist Program or Minimalist Syntax. Chomsky’s theories came to be called Mainstream Generative Grammar. At this point I don’t know enough about Mainstream Generative Grammar to have an opinion about its validity. So my starting point has to be the study of Chomsky’s Mainstream Generative Grammar.

One alternative theory I find intriguing is The Parallel Architecture proposed by Ray Jackendoff. Jackendoff asserts that the Parallel Architecture preserves several important aspects of Chomsky’s Mainstream Generative Grammar but aligns better with recent discoveries in cognitive science. Jackendoff also rejects the complexity of Chomsky’s latest approach, Minimalist Syntax. The book Simpler Syntax, by Peter Culicover and Ray Jackendoff, examines the Parallel Architecture in detail. But Simpler Syntax contrasts The Parallel Architecture with Mainstream Generative Grammar so an understanding of Chomsky’s approach is still required.

I am starting my studies with the book, Syntax, A Generative Introduction by Andrew Carnie.  The book is an introductory textbook on syntactic theory aimed at upper division undergraduates and it is based primarily on Principles and Parameters, the predecessor to Chomsky’s Minimalist Program.  Carnie also provides introductory chapters on Lexical Functional Grammar and Head Driven Phrase Structure Grammar on his book’s website. In the preface, Carnie admits glossing over thornier issues of linguistic theory so I will need to follow Carnie’s Syntax with a second book. Originally, I started reading Minimalist Syntax, Exploring the Structure of English by Andrew Radford but found this book a bit too dense for a first book on Generative Grammar. Carnie actually recommends Radford’s book for further study and I plan to return to it after completing Carnie’s Syntax.

Book Review: The Language Instinct by Steven Pinker

Reading The Language Instinct, by Steven Pinker motivated me to delve deeper into the study of linguistics beyond  the introductory material presented in natural language processing textbooks. The Language Instinct is an accessible, entertaining, and at times humorous introduction to several important ideas in linguistics, written for a general audience. In the book, Pinker focuses on Universal Grammar, the linguistic theory developed by Noam Chomsky. In describing and defending Chomsky’s theories, Pinker uses anecdotes and examples from a wide variety of studies including his own work in language acquisition in children. But the book is not limited to grammar. Pinker explores a wide range of subjects including the history and development of modern languages, the areas of the brain that processes speech as implied by specific speech pathologies that are associated with injuries to specific brain areas, and speculation about how speech processing might be implemented from a connectionist view of neural networks.

The central theme of the book is that human beings have an innate talent or instinct for acquiring language. The language instinct is embodied in the brain as the language faculty or language organ. The abilities and limitations of the language faculty, thus the structure of language, is determined by the genetic makeup of humans. Supporting the theory of an innate human language faculty is the idea of a Universal Grammar underlying all of the world’s languages. While languages differ, they differ in predictable ways that are constrained by the limits of the human language faculty. Delving deeper, Pinker asserts that there is an internal representation of ideas in the mind that is not dependent on language for conceptualization. We have all had thoughts that we find difficult to put into words so there must be a richer mental representation than written or spoken language.

Pinker has come under a bit of criticism within the field for defending Chomsky too strongly. His critics feel that The Language Instinct is only telling half of the story of linguistic inquiry. Chomsky’s Universal Grammar is not the entire field of linguistics, nor even a consensus view of grammar. Within linguistics there is a historically deep division between those who study language as innate versus those who study language as cultural invention. The truth and the most interesting area of study probably lies in the messy middle ground between these two camps. While Pinker unabashedly defends Chomsky’s approach, he does tell you that there are other ways of looking at language and the mind. He tells you when he is saying something controversial and when he is speculating.

There are few books on linguistics written to explain the field to a wide audience. Even if Pinker’s approach is not comprehensive and emphasizes Universal Grammar over other approaches, you will still learn a great deal about language and linguistics by reading his book.  If you have never studied linguistics, The Language Instinct is a good starting point to motivate further study.