We embedded systems programmers have some special challenges in our work. One of the major problems we face is that the hardware we work with is often not fully exercised until our software exercises it. So, it is tempting to blame the hardware when things don’t work right and the cause is not obvious. I’ve done my share of blaming the hardware, sometimes to the point of embarrassment when it turned out I was wrong.
Over the years I’ve learned to work with the hardware engineers to solve a problem rather than point my finger. I’ve learned the hard way to devise some low level tests to isolate a suspected hardware problem before I go bother the hardware designer. Sometimes a ’scope or a logic analyzer are the best software debugging tools and embedded systems programmer can have. But sometimes you run across a bizarre software problem that masquerades as an obvious hardware problem. A couple of these kinds of bugs will change your approach to debugging forever.
Will Rogers once said “There are three kinds of men. The one that learns by reading. The few who learn by observation. The rest of them have to pee on the electric fence for themselves.” For those who learn by reading or observation here are a couple of war stories about obvious hardware problems that weren’t.
The first story involves a very successful piece of test equipment. If you ever worked with T1s and I told you the model number you would probably recognize it. I worked on this product toward the end of its life cycle. I did some of the last feature upgrades and became responsible for software maintenance on it. This test set used a popular Intel counter-timer chip. The same chip was used in the original IBM PC and there are incarnations of it in the system chips of personal computers to this day. NEC was a second source for this part but for some reason this test set would not work with the NEC version of the timer chip. There was even a note on the BOM (Bill of Materials) that only the Intel part could be used.
Well, one day, something fell through the cracks and a batch of these units were built with the NEC part. I got a call from production test that the software was failing in these units. So, I went over to the factory and picked up one of the failing units to look at with an ICE. As soon as I took the unit apart, I saw the NEC timer and knew that was the problem but when I called the production manager, he insisted that I should take a second look at the software. We used the same NEC time in several other products with no problems and it was getting harder to get the Intel part. I was certain that it was not a software problem. There were literally hundreds of these test sets in the field with this software, working just fine. I couldn’t very well say no so I set up a a couple of breakpoints and ran the unit trough its paces. As I single-stepped through the ISR for the timer for about the tenth time, I noticed that it pushed one more register on the stack than it popped off at the end. There it was in x86 assembly code, a function that should have crashed every time no matter what timer chip was used. I fixed the routine and the test set worked just fine with the NEC part from then on. To this day I have no idea why the problem never showed up with the Intel timer chip but if it weren’t for the persistence of that production manager, that bug might have showed up at another time in another way.
The second story is about another very popular product from the same company. I never worked on that product but a coworker told me about this one. Test equipment tends to have a very long product lifetime. This product had undergone so many upgrades that they decided that it was time for a major software rewrite. The project went well until system testing. It seemed that the units in the lab worked just fine but when they were buttoned up with the covers on, the software crashed. How could the presence of the cover affect the software? It seemed like a classic hardware problem. Perhaps it was a problem with noise or temperature. The mystery was that the previous software release worked just fine, with or without the covers on. Debugging this was a nightmare. You couldn’t hook up an ICE with the cover on the unit so they had to write test code and install it and put the cover back on. There didn’t seem to be any rhyme or reason to the software crashes. Meanwhile they were also pursuing another seemingly unrelated problem. If two of the pins on the RS-232 interface were shorted together (like RTS to CTS) the software would crash as soon as the test set was powered up. It turned out that in the start up code in the new version, someone enabled interrupts before the rest of the hardware was initialized. Shorting the RS-232 pins together or putting the cover on the unit added just enough noise on a floating interrupt line to cause it to trigger an interrupt. Since the interrupt vector table had not been initialized at this point, the interrupt vectored to an invalid address and crashed the system. Yep, another classic hardware problem that turned out to be software.
So what should you take away from these stories? Never assume the problem is hardware. If it seems like it is hardware, work with the hardware engineer to solve the problem. Don’t just point your finger. Use a ’scope or a logic analyzer if you know how and if you don’t get an EE to drive for you. Some very bizarre software bugs can disguise themselves as hardware problems.