Embedded Systems Software, Computer Networking and Geeky Fun

nerd1951.com

April 9, 2007

Cannonball Update: Packet Demultiplexing

Filed under: Projects, Cannonball — harvey.sugar @ 9:37 pm

I was reading about packet demultiplexing in Network Algorithmics during my flight to San Jose last week. The chapter discussed the advantages of “early demultiplexing” vs the traditional protocol layer-by-layer demultiplexing. In early demultiplexing the packet’s entire path through the protocol stack is determined in one operation as I discussed in Cannonball Update: A Radical TCP/IP Architecture.

Two high-performance packet demultiplexers were discussed in the chapter: Pathfinder and the Dynamic Packet Filter. Of the two implementations, the Dynamic Packet Filter is faster but requires run-time code generation and I feel that this is beyond the scope of my project.

Both of these implementations are based on a data structure called a trie (from retrieval). As shown in the diagram the data is laid out in a tree like structure but the nodes do not contain the keys. Rather a complete key is represented by the position of the node in the trie. In a demultiplexing scheme, each node would contain a header field, a mask, and a bit pattern to match. For example the root of the tree would decode the Ethernet protocol type field.

trie example

Evgeniy Polyakov is working on tries for the Linux TCP/IP stack as he discusses in his blog: Zbr’s days.

Right now I’m designing classes to implement a demultiplexer similar to Pathfinder then I plan to generalize the classes into template classes. These will be the first component in my Networking Protocol Library.

• • •
 

March 26, 2007

Cannonball Update: A Radical TCP/IP Architecture

Filed under: Projects, Cannonball — harvey.sugar @ 5:36 am

I’ve been looking at several papers about TCP/IP stack architectures similar to the one proposed by Van Jacobson (and others, see the x-kernel for example). The basic idea is that packets are de-multiplexed as they enter the system and passed directly to the target application. Then the application thread is responsible for performing all the protocol processing. The application accomplishes the protocol processing by calling library functions for each layer. When packets are transmitted, the application calls library functions to push the protocol headers on to the packet and then the application queues the packet for transmission.

Radical software architecture for TCP/IP

I really like this approach because it illustrates how C++ classes can be used to implement the concepts of a layered protocol while the execution model is completely separate for the layer concept. C++ classes would be used to implement the protocol processing libraries. The de-multiplexer could also use layer-based methods for parsing the packets. So, while I am writing the code I can think of the protocols in a natural way – as protocol layers. When packets are processed by the system, the layers are virtually eliminated from the execution path.

All of the queues in this model have a single source thread and a single destination thread so lock free queues can be used. This would result in a considerable improvement in performance which was Van Jacobson’s goal in proposing this architecture. I believe that C++ makes this architecture easier to implement since each application can instantiate its own objects for the lower level protocols. This allows a natural sharing of protocol code among application threads.

It is a little more complicated than the diagram shows. Several of the protocol layers will need to run their own timer threads and some packets are destine for lower protocol layers. I’ll work out these issues as I move into a detailed design.

• • •
 

March 21, 2007

Cannonball Update: Test Environment

Filed under: News, Projects, Cannonball — harvey.sugar @ 8:49 pm

In other news I have a target system set up.  It’s an old Dell Pentium III running at 800 MHz. with 256 Megabytes of RAM.  I’m running Ubuntu Linux since it is easy to configure a minimal system with Ubuntu.  It’s basically just a shell – no GUI, GCC and friends, FTP, SSH and the Kernel.  The shell and FTP are pretty snappy even on this old machine.  I think it’s a tribute to the performance of the Linux Kernel.

My next step is to set up a separate network for testing my TCP/IP stack without bringing down the rest of my computers and my Internet link.  I plan to put a second NIC card in my Fedora Linux desk-top and into the Ubuntu test machine and put them on their own 100Base-T switch.  I’ve done enough protocol software testing to know that you don’t want to let an experimental protocol stack talk directly on your production network or the Internet.

• • •
 

CannonBall Update: Initializing the Protocol Stack

Filed under: News, Projects, Cannonball — harvey.sugar @ 8:43 pm

I’m still working on the architecture for my TCP/IP stack.  Specifically I’m trying to define a common interface for a protocol layer.

One problem that’s been bugging me with the TCP/IP protocol stack is initialization.  I’ve often worked on embedded systems where initialization was nor well planned and was sort of done ad-hoc.  This leads to all kinds of problems down the road with race conditions or when you add a new feature and find that something you need to be initialized isn’t.

What makes this even more complex for a protocol stack is that each layer needs to know what services are available from the lower layers.  The lower layers don’t know what layers above them are using their services.  You don’t want to hard code this stuff because it makes it harder to add new protocols to the system.

In Linux, the TCP/IP stack is initialized within the Kernel.  The order of initialization is pretty much hard coded but network interfaces can be stopped or started while the system is running.  The problem that I have with Linux is that much of the configuration information that is required to initialize networking is scattered in multiple files in different directories under /etc.  Network applications are started after the TCP/IP stack is running.

What I want to accomplish is to have all the configuration information in some common database and a single overall controller that can start, stop and reconfigure components of the TCP/IP stack.  Once I figure this part out I can define the interface between the controller and a protocol.  Then I can define the interface between protocol layers.

• • •
 

March 17, 2007

Cannonball Update: Notes on Network Stack Architectures

Filed under: News, Projects, Cannonball — harvey.sugar @ 11:25 pm

I’ve started working on an architecture for my TCP/IP stack and I am deep into designing the abstract base class for the protocol layers.  Many network researchers advise against implementing protocols in the same layers as the protocols are specified.  The conventional wisdom is that this approach hurts performance.

I believe this stems from the original TCP/IP implementation done at Berkley in the early 1980s.  The Berkeley TCP/IP stack was widely distributed with BSD Unix and was the reference implementation for TCP/IP.  One major performance issue in the original BSD Unix TCP/IP stack was that each protocol layer was implemented as a separate process.  This approach burned up a lot of CPU cycles and as a result the BSD TCP/IP implementation was very inefficient.

The problem with the Berkley approach was that they used an operating system construct to do conceptual partitioning.  Conceptual partitioning is the method you use to divide a complex problem into manageable parts with each part representing a concept.  We do this all the time in software engineering, starting with dividing a system into subsystems, then modules or classes and finally individual functions.  When networking protocols were being designed, protocol layers provided a conceptual framework for partitioning networking functions.  This is the origin of the ISO seven layer model for networking protocols or the five layers used in TCP/IP.

The performance problem is not from partitioning the software into same protocol layers.  The issue is the mechanism you use for dividing the software into layers.  Processes or threads are not designed to be used for conceptual partitioning.

C++ classes on the other hand provide an ideal mechanism for conceptual partitioning.  This is what classes and objects were designed to do.  I can design an abstract protocol layer base class and assign networking functions to each concrete layer class just as the protocol layers are specified.  At the same time I can design the threading model for the networking protocol stack to obtain maximum performance and efficiency.

• • •
 

March 13, 2007

Cannonball Update: Moving Beyond Benchmarks

Filed under: News, Cannonball — harvey.sugar @ 7:29 pm

Now I remember why I don’t like benchmarks much. Benchmark results are often hard to interpret as I showed with the state machine measurements. Measuring one small feature or execution path out of the context of an application does not provide meaningful results. The proper way to optimize a system is to start with a solid design, build the system, instrument the system to find the bottle necks and optimize those areas. Then you measure again to determine if the optimization was effective. So I’m changing direction yet again and starting to design my TCP/IP stack using modern C++ concepts.

Designing and implementing a TCP/IP stack is a big job for one person to take on part time as I’ve noted in the past.  Here are the tactics I’ll use to make this feasible:

  • Use modern C++ design techniques including Generic Programming and Design patterns.  One of the major goals of this exercise is to show how using modern C++ programming makes the programmer or software engineer more productive without sacrificing performance.
  • Narrow the functionality.  I’ll only implement the minimum that I feel is essential for providing realistic performance results: IP, ICMP, UDP and a TFTP application that stores data in RAM.
  • Use Linux as both the development environment and the execution environment.  Linux provides a very productive development environment that includes the C++ Standard Library, the GNU C++ compiler GCC, and the Eclipse IDE.  Linux also provides tools for profiling and measuring code performance and tools for measuring network performance.  Finally, I can measure my TCP/IP stack against the Linux TCP/IP stack so I can do an almost apples-to-apples performance comparison.

The socket interface in Linux provides access to raw layer 2 (Ethernet) packets.  This is done by creating a packet socket:

packet_socket = socket(PF_PACKET, int socket_type, int protocol);

by specifying the protocol family to be PF_PACKET and the socket_type to be SOCK_RAW (see packet(7) - Linux man page for incomplete details). This socket option is what is used by sniffer programs to gain access to raw packets.  This also allows a TCP/IP stack to be implemented as a user space application.  This approach is the mainstream implementation method for research TCP/IP stacks.

I don’t plan to stop there with yet another research TCP/IP stack.  I already have some plans for the release beyond Cannonball.  My plan includes porting the TCP/IP stack to two or three embedded processors, ARM, PowerPC, and ColdFire.  I also plan to fill out the stack in the next release by implementing TCP and some additional protocols.

• • •
 

March 7, 2007

Cannonball Update: Some State Pattern Results (Abstraction Costs)

Filed under: News, Cannonball — harvey.sugar @ 10:22 pm

I’ve finished a very simple comparison of two state machine implementations; the State Design Pattern as described in the “Gang of Four” Design Patterns book and a C style state machine using switch statements.  I’ve taken a minimalist approach because I’m busy with studying C++ Templates and getting ready for the Embedded Systems Conference.  If you want to see the code you can download it from here:

StatePattern.tar.gz
StatePattern.zip
StateSwitch.tar.gz
StateSwitch.zip

To be brief, I’ve stripped away almost all the code except the state transitions themselves.  My goal is to measure the cost of using C++ virtual functions.  I had each implementation process one billion random events each of which cause one state transition.  The results were that the State Pattern implementation averaged about 45 nanoseconds per event and the C style implementation averaged about 23 nanoseconds on a 2 GHz. Pentium 4 running Fedora Core Linux.

There are two ways of interpreting these results.  I want to be careful here because there is so much mythology about the cost of virtual functions in the embedded systems community.  Some embedded systems programmers will totally avoid using virtual functions or any kind of class hierarchies because of the overhead incurred.

So you could look at this and say “Look, using virtual functions took TWICE as long as using the C style implementation!” But on the other hand, we are talking about 22 nanoseconds.  Let’s keep that in perspective.  Even if we scale that for the clock rate of a typical embedded processor running at say 60 MHz. the difference is still under half a microsecond and this ignores the fact that most embedded processors have more registers than an x86 processor so they could gain some additional performance in handling virtual functions. 

There are plenty of better places in a program to pick up performance gains without resorting to ignoring major language features.  As Knuth said “Early optimization is the root of all (programming) evil.”  Note also that I’m not talking about all embedded systems here.  I wouldn’t suggest using C++ and all of it’s features in a small embedded system based on a PIC or an Atmel AVR processor for example.  C++ is designed for larger more complex tasks and there are plenty of embedded systems that fall into that catagory.

I’ve ordered my NetBurner MOD5270LC Development Kit which is based on a 100 MHz. Coldfire Processor.  I’ll be able to get a more concrete result for a typical embedded system when I run the same tests on that.

• • •
 

February 27, 2007

Quick Cannonball Update

Filed under: News, Projects, Cannonball — harvey.sugar @ 2:42 pm

I’ve been running benchmarks that compare the run time of state machines implemented as the State Design Pattern vs. a switch statement implementation and a function table implementation. So far the results have been inconsistant and I am debugging the benchmark code and trying to interpret the results. The final results and the bench mark code should be posted in the next few days.

• • •
 

February 13, 2007

Cannonball Update: Behind the scenes

Filed under: News, Cannonball — harvey.sugar @ 11:59 am

It’s been very quiet around here on nerd1951.com but that doesn’t mean that nothing is going on. I’ve been hard at work behind the scenes; coding performance tests for the various software components that I will need to implement Cannonball.

I started working with the FIFO queues but I still have a lot of work to do before I’ll have any data to report. There are so many implementation options that performance testing the FIFO queues is a pretty big job. In the mean time I’ve started testing the “abstraction costs” of using the State Pattern (see Design Patterns) vs. the traditional state machine implementations in C using case statements or arrays of functions. I started on this test because I already have source code for these implementations. I just needed to strip out some extra stuff so that I’m only measuring the difference in the state machine implementations and not any extra code.

So what is “abstractions costs?” Abstraction costs are the additional resources, memory space and instruction cycles, required to implement a piece of software at a higher level of abstraction. C++ is designed to allow us to consciously make the tradeoff between abstraction and performance but we need to understand what the abstraction costs are in order to make an informed decision.

The other thing slowing down the FIFO Queue evaluation is that like most embedded systems C++ programmers I haven’t paid much attention to templates up until recently.  I found that the coverage of templates in Stroustrup’s The C++ Programming Language did not include all the techniques used by Alexandrescu in Modern C++ Design so I just picked up a copy of C++ Templates: The Complete Guide to dig deeper into programming with C++ templates.

Templates have gotten a bad rap in the embedded systems community. Most embedded system programmers perceive C++ Templates as an unnecessary complexity. The other problem is that until recently many cross-compilers, used by embedded systems programmers, have not done a good job of supporting templates or the C++ standard library. Now days a lot of embedded systems development is done using the GNU Gcc compiler which does a good job of supporting templates.

• • •
 

January 27, 2007

Cannonball Update: Library classes and test code

Filed under: News, Cannonball — harvey.sugar @ 8:34 am

I’ve started a list of classes from the Standard C++ library and the Boost C++ Library that will be useful in implementing a network protocol stack. The list is on the C++ library classes page under Design Notes. In some cases I may use the policy classes from the Loki C++ library for more flexibility or write my own classes that implement interfaces defined in the Standard C++ library.

In other news, I’ve started writing the test code for various queue class implementations. The purpose is to assess if I can obtain any performance gains by writing specialized container class implementations using policy classes for specific locking policies: lock free, locked reader, locked writer, and double locked, as described on the FIFO Queue design notes page. My plan is to implement a specialized container class so that I can use the queue class interface specified in the Standard C++ library.

• • •
 
Next Page »