December 20, 2022

BSD TCP/IP for Kyu - It just hangs!

I have a python script that connects to port 114 on the board running Kyu and then requests a "big" data transfer. I am testing with a 200,000 byte transfer (which takes about 1 second). The script repeats the request over and over as fast as possible. After a lot of transfer, everything stops, the Kyu console is unresponsive, and the only recovery is to hit the reset button. Two recent test hung after the following number of transfers:
4654 tests (200K * 4654 = 930M)
1234 tests (200K * 1234 = 247M)

What can we observe?

As an experiment I changed the size of the default thread stack from 4096 to 16384 bytes. This value is in kyu.h and it is necessary to do a "make clean ; make" to rebuild Kyu after such a change. This did not solve the problem, the test hung after 1224 transfers. Looking at wireshark (which I have continually recording all traffic to the ethernet address of my test system) I see that the hang happened in mid transfer. I now have generate the data for the transfer so that the start and end blocks are clearly evident (they are filled with the strings "START" or "END--" repeated to fill a 500 byte block. Blocks in the middle are filled with "PK168" or some such, where the number counts which block of the transfer it is. This makes it sort of easy to diagnose where in the game we are when things do hang.

I do writes of 2000 bytes in Kyu, but wireshark tells me that all of the traffic is packets with 500 byte payloads, so I have taken care that the "data marking" is done with this quantization in mind.

Could we just ignore this?

It typically takes about 30 minutes of full tilt data transfer to see this bug. So, if we have done 1200 transfers, each has 200K of data, so 100 writes of 2K, which become 400 network packets. So 1200 * 400 = 480,000 packets outgoing. Certainly an infrequent event, whatever it is.

A person might ask whether we are just rolling the dice for something like a race condition or whether we have something like a memory leak. It is hard to say with just a few data points on how frequent it is. But at any event, it would be sloppy work without excuse to ignore this.

How do we debug this?

Since this seems to happen at unpredictable times, and once we are hung we have no options to learn more, our debugging becomes difficult. Here are some ideas:

Perform other tests using the network, find out what does or does not cause the system to hang. Try transfers with UDP (which has nothing to do with the new Kyu code). Think of diagnostics that will keep Kyu busy for days on end without having anything to do with the network. Try back and forth TCP testing with short packets (which is a common actual use case).

Get this running on other hardware (the BBB or even the Zynq) and see if the problem reveals itself in a different way.

Find out if Kyu is responding to interrupts when hung. We could use a GPIO interrupt (i.e. a button) or a serial interrupt. Currently kyu sits polling the uart when it is idle. If we implement a serial console, Kyu will sit in the idle loop waiting for interrupts.

How does Kyu act if we go into a useless loop in some thread running at higher priority than the console? We would expect the console to just "go away" as it will never get scheduled. Timer interrupts should still run. We could have a thread running at a higher priority than our "hog" thread and perhaps do something like "say hi" every 60 seconds. If this works, we could run such a thing while we do a long network test and see if during a hang we are still getting timer interrupts and scheduling the "hi" thread. Actually, we might as well avoid the complication of the "hi" thread and just send a greeting directly from the interrupt code, although both would be instructive.

Try something -- and it works!!

So I decided to try one of the easiest tests. I ran a thread at priority 10 that simply prints "Alive and kicking" every minute. Then I started the usual server on port 114, ran my Python script, and found something to do. After 1780 tests, sure enough it locked up. But to my amazement, the "alive and kicking" messages continue! This is very good news. The processor has not gone into the weeds, but somehow the TCP thread (or some related network thread) has found a way to be a hog and not let anything else run. For review, here is the scheme with thread priorities:
  Thread:       name (  &tp   )    state      sp     pri
  Thread:        net (400a75a8)     SEM J 40578000   10
  Thread:        dog (400a7---)     SEM J 4057----   10
  Thread:   net_slow (400a74b8)  REPEAT C 4057c000   11
  Thread:    tcp-bsd (400a7788)     SEM J 40570000   12
  Thread:  tcp-timer (400a7698)  REPEAT C 40574000   13
  Thread: wangdoodle (400a72d8)     SEM J 40584000   30
* Thread:      shell (400a7968)   READY I 40568000   40
  Thread:       idle (400a7878)   READY C 4056c000 1234
  Thread  Cur : (400a7968)
The "dog" thread at priority 10 is what sends out the "alive and kicking" message. Our test server ("wangdoodle") at priority 30 is what doles out data in large quantities to our python script. The Kyu shell runs at priority 40.

As currently coded, the shell must be the lowest priority thread in the mix because it polls the uart endlessly waiting for characters to be typed. So, if wangdoodle finds some way to go into an infinite loop (ar any of the other tcp and network related threads), then the shell will be starved and the system will appear to be hung.

The good news is that nothing appears to be wrong with Kyu itself. What is tempting to do at this point is to recode the getchar routine that the shell uses so that it is interrupt driven. Then it could be run at a higher priority (and block on some semaphore related to characters being typed). This would allow the shell to be run at a higher priority than any network threads and remain responsive.

What I do next is to overhaul the orange pi serial driver so it can use interrupts. I have to do a bit of special work so that the console can come up initially using polling, then switch to using interrupts when Kyu is fully initialized. This is because I need Kyu semaphores for the driver, but it is also a good thing in general to be robust during initialization.

Once this is done, I can run the shell at priority 11 without it hogging all the CPU resources to do polling, and I launch the TCP test. Now if and when it hangs I get get some information.
The threads and priorities now look like this:

  Thread:       name (  &tp   )    state      sp     pri
* Thread:      shell (400a7984)   READY J 40568000   11
  Thread:        net (400a75c4)     SEM J 40578000   12
  Thread:   net_slow (400a74d4)  REPEAT C 4057c000   13
  Thread:    tcp-bsd (400a77a4)     SEM J 40570000   14
  Thread:  tcp-timer (400a76b4)   READY C 40574000   15
  Thread: wangdoodle (400a7204)     SEM J 40588000   30
  Thread:   wang_thr (400a73e4)   READY I 40580000   31
  Thread:       idle (400a7894)   READY C 4056c000 1234

What about JTAG

Well, we may not need this after all, but ...

I make this a topic of its own because I expect it to be a substantial can of works.

First of all, we need to get JTAG hardware connected to our board. It is not entirely clear how to do this with the Nanopi Neo board I am now working with (unless it can be done by repurposing GPIO pins, it simply isn't possible). A bigger Orange Pi board might do better, but even then it may or may not be possible.

Both the BBB and my Zynq based EBAZ boards have clearly labeled and designated connectors for JTAG. On one hand, the BBB uses an odd little surface mount connector. I have the connector and spent good money for it, expecting just such a day like this. On the other hand, the EBAZ has a nice connector ready to go with standard pins on 0.1 inch spacing.

Shifting my work to the EBAZ would have another benefit. I have not done a Kyu port to the Zynq, but intend to do so. It would provide a bit of a break from all this work on TCP, but the biggest part of the job would be writing an ethernet driver. This would be new hardware (albeit still a 32 bit ARM) which could expose the hang in a different way. I would like to learn how to use JTAG with the EBAZ for other reasons (like injecting a Kyu image via JTAG into NAND flash on board). This is tempting.


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org