December 29, 2022

BSD TCP/IP for Kyu - The Kyu hang revisited

I ended up setting this aside over the Christmas holidays. Then I resumed, and instead of tackling this bug, I attacked (and fixed) the unexpected shutdown bug (which was a bug in my port of the BSD code). Now I am ready to focus again on this bug, but I need to refresh my mind on what the problem is, what I know, and what I have done thus far.

I reproduce a hang by rebooting Kyu, starting the "wangdoodle" server on port 114 using the "t 5" command and then running the python "try" script. Here is what the threads look like after starting the server, but before starting the try script:

  Thread:       name (  &tp   )    state      pc       sp     pri
* Thread:      shell (400a7998)   READY I 4000460c 40568000   11
  Thread:        net (400a75d8)     SEM I 400108fc 40578000   12
  Thread:   net_slow (400a74e8)  REPEAT C 40015a64 4057c000   13
  Thread:    tcp-bsd (400a77b8)     SEM J 40010494 40570000   14
  Thread:  tcp-timer (400a76c8)   READY I 400108f8 40574000   15
  Thread: wangdoodle (400a7308)     SEM I 400108f8 40584000   30
  Thread:       idle (400a78a8)   READY C 40010950 4056c000 1234
  Thread  Cur : (400a7998)

After only 244 transfers, we are hung. An odd thing to be pleased with. I know we are hung because the python script is stuck waiting for a transfer to complete. Because I now run the shell at priority 11, it is not being starved by the odd loop that "wang_thr" is in, and I am able to examine the system:

Kyu, ready> l
  Thread:       name (  &tp   )    state      pc       sp     pri
* Thread:      shell (400a7998)   READY J 40010494 40568000   11
  Thread:        net (400a75d8)     SEM J 40010494 40578000   12
  Thread:   net_slow (400a74e8)  REPEAT C 40015a64 4057c000   13
  Thread:    tcp-bsd (400a77b8)     SEM I 400108f8 40570000   14
  Thread:  tcp-timer (400a76c8)   READY C 40023c34 40574000   15
  Thread: wangdoodle (400a7308)     SEM J 40010494 40584000   30
  Thread:   wang_thr (400a73f8)   READY I 400104f0 40580000   31
  Thread:       idle (400a78a8)   READY C 40010950 4056c000 1234
I can investigate what the "wang_thr" is doing via the "u" command:
Kyu, ready> u wang_thr
Thread wang_thr, dump (-)
  Thread:   wang_thr (400a73f8)   READY I 400104f0 40580000   31
 state:   READY
 stack: 40580000
 mode: INT
CPU: 0
pc : [<400104f0>]	   lr : [<40010494>]
sp : 40583ea0  ip : 00000001	 fp : 40583ebc
r10: 00000002  r9 : 5bf42ee8	 r8 : 5ffb4b70
r7 : 5bf452a0  r6 : 40000000	 r5 : 00000002  r4 : 4005c100
r3 : 40037d64  r2 : 00000000	 r1 : 00000000  r0 : 00000000
cpsr: 60000013  Flags: nZCv  IRQs on  FIQs on  Mode SVC_32
PC = 400104f0 ( change_thread+1f0 )
Called from thr_unblock+178 -- 4000fd44
Called from sem_unblock+70 -- 40010c0c
Called from user_unlock+1c -- 40023a28
Called from tcp_send+88 -- 4002ee5c
Called from big_wangdoodle+88 -- 40032190
Called from one_wangdoodle+fc -- 400322e8
Called from run_wangdoodle+78 -- 40032388
Called from thr_exit -- 4000f7a0
Remember that the "wang_thr" is launched anew for every connection. Aso note that the traceback ends with thr_exit(). This is true for every thread (and not particularly important here). It is because Kyu arranges things so that if a thread ever returns, it will exit.

A side note

I am already reminded of a couple of enhancements I would like to make to the above display. First of all, the last line that gives the value of the "cur_thread" pointer is sort of redundant given the "*" marker on the left side that shows the current thread. More importantly, this marker and value are all but useless given that the shell is always the current thread when we print this display. We would like to somehow show what thread was running before the shell preempted it. This would require having a "prior_thread" pointer and it should be a simple matter to keep it up to date in the routine change_thread().

I will note that developing improved debugging tools is most definitely worth the time and effort. So, I will take the time right now to do this. This involves making changes in thread.c, particularly to thr_show().

This was a simple change, but it does make the "l" command more interesting. Before I start the test, I see the prior thread is always tcp-timer, but when the test is running, I see alternately "net" and "wang_thr" (mostly wang_thr) and occasionally "tcp-bsd". This all seems as it should be. When things are running happily, I see the following (which is just a random sample while the test is running).:

Kyu, ready> l
  Thread:       name (  &tp   )    state      pc       sp     pri
  Thread:      shell (400a799c)   READY J 400104ac 40568000   11
  Thread:        net (400a75dc)     SEM J 400104ac 40578000   12
  Thread:   net_slow (400a74ec)  REPEAT C 40015a94 4057c000   13
  Thread:    tcp-bsd (400a77bc)     SEM J 400104ac 40570000   14
  Thread:  tcp-timer (400a76cc)  REPEAT C 40023c64 40574000   15
  Thread: wangdoodle (400a730c)     SEM J 400104ac 40584000   30
* Thread:   wang_thr (400a73fc)   READY I 40006218 40580000   31
  Thread:       idle (400a78ac)   READY C 40010980 4056c000 1234

An important rule to follow (and I just broke it) is to not make code changes while the long test is running. The reason is that if I start investigating code addresses in detail, I want to be able to use the disassembly in "kyu.dump". So I restarted the test with the new build and will restrain my fingers in the meanwhile.

This time I get a hang after only 99 transfers and the thread display is as follows:

Kyu, ready> l
  Thread:       name (  &tp   )    state      pc       sp     pri
  Thread:      shell (400a799c)   READY J 40010464 40568000   11
  Thread:        net (400a75dc)     SEM J 40010464 40578000   12
  Thread:   net_slow (400a74ec)  REPEAT C 40015a4c 4057c000   13
* Thread:    tcp-bsd (400a77bc)     SEM I 400108e4 40570000   14
  Thread:  tcp-timer (400a76cc)   READY C 40023c1c 40574000   15
  Thread: wangdoodle (400a730c)     SEM J 40010464 40584000   30
  Thread:   wang_thr (400a73fc)   READY I 400104c8 40580000   31
  Thread:       idle (400a78ac)   READY C 40010938 4056c000 1234
What I find interesting and surprising here is that the "prior" thread pointer is now pointing to "tcp-bsd". Previously I didn't have that information and looked at the above listing and saw the "wang_thr" as the only ready thread. This list has other surprises as well, namely that "tcp_timer" is also ready to run. The above is just one random sample. Another gives the following:
Kyu, ready> l
  Thread:       name (  &tp   )    state      pc       sp     pri
  Thread:      shell (400a799c)   READY J 40010464 40568000   11
  Thread:        net (400a75dc)     SEM J 40010464 40578000   12
  Thread:   net_slow (400a74ec)  REPEAT C 40015a4c 4057c000   13
* Thread:    tcp-bsd (400a77bc)     SEM I 400108e4 40570000   14
  Thread:  tcp-timer (400a76cc)  REPEAT C 40023c1c 40574000   15
  Thread: wangdoodle (400a730c)     SEM J 40010464 40584000   30
  Thread:   wang_thr (400a73fc)   READY I 400104c8 40580000   31
  Thread:       idle (400a78ac)   READY C 40010938 4056c000 1234
Here we see "tcp-timer" has changed state to REPEAT, which indicates that it does run or something is happening.

Here is what "t 1" shows us with respect to TCP data structures:

INPCB: 40168780     Listen -- local, foreign: 0.0.0.0 111 .. 0.0.0.0 0
INPCB: 40168348          5 -- local, foreign: 192.168.0.73 114 .. 192.168.0.5 34346
INPCB: 401680a8     Listen -- local, foreign: 0.0.0.0 114 .. 0.0.0.0 0

Socket: 40168018 ACTIVE pcb = 401680a8, state = 4000
Socket: 401682b8 ACTIVE pcb = 40168348, state = 4022 CONNECTED CANTRCV
Socket: 401686f0 ACTIVE pcb = 40168780, state = 4000
Socket: 40168880 -inactive- pcb = 00000000, state = 0000
Status 5 is "close wait", also note the CANTRCV bit set in the socket, but this has been sitting for a while and perhaps the linux side has timed out and taken some action to close the connection. It is time to get a "fresh hang" to see. And indeed, a fresh hang shows the same thing, hung in close-wait and CANTRCV. This time the thread display looks like this:
  Thread:       name (  &tp   )    state      pc       sp     pri
  Thread:      shell (400a799c)   READY J 40010464 40568000   11
  Thread:        net (400a75dc)     SEM J 40010464 40578000   12
  Thread:   net_slow (400a74ec)  REPEAT C 40015a4c 4057c000   13
  Thread:    tcp-bsd (400a77bc)     SEM J 40010464 40570000   14
* Thread:  tcp-timer (400a76cc)   READY I 4000f7bc 40574000   15
  Thread: wangdoodle (400a730c)     SEM J 40010464 40584000   30
  Thread:   wang_thr (400a73fc)    DEAD I 400117a4 40580000   31
  Thread:   wang_thr (400a721c)   READY I 400104d4 40588000   31
  Thread:       idle (400a78ac)   READY C 40010938 4056c000 1234
The odd thing here is the wang_thr marked "DEAD", I have not seen that before. A thread gets marked "DEAD" in thr_exit() and should just get cleaned up and removed. Why didn't this one?

Just how hung are we?

I am ready to toss out most of what I thought I knew and start fresh.

I try "t 7" which makes a client connection from Kyu to a time server on my linux machine.
This works.

I try "t 2" which starts a server on port 111, then I use telnet to connect to it. It responds (as it should) with "Dead men tell no tales", as it should.
This works.

In other words, the bulk of Kyu (scheduling, semaphores, timers) is working just fine. And in fact the bulk of TCP code seems to be working just fine.

I use Ctrl-C to kill the hung python script, then run it again -- and it works! There is a hung connection handling thread, but that doesn't get in the way of a new connection being made and transactions being handled.

More debug information

I add some code to the "wangdoodle" test server to post information in a couple of static variables. I also add a "t 9" to show the values of those variables. This will let me find out just where the hung server is running. You might think I could examine the value of the PC register, and maybe you are right. But with this extra information available, I run the test again. I get a hang after 650 transfers. No "DEAD" thread this time. The "t 1" and "t 9" commands show:
Kyu, ready> t 1
Kyu, ready> INPCB: 40168348      Close-Wait -- local, foreign: 192.168.0.73 114 .. 192.168.0.5 58760
INPCB: 401680a8          Listen -- local, foreign: 0.0.0.0 114 .. 0.0.0.0 0
locker count: 0
Input thread: idle
Timer thread: idle
User lock: busy
  mbuf: alloc =  1, free =  12, max = 13
mbufcl: alloc =  0, free = 256, max =  9
  sock: alloc =  2, free =   9, max = 11
 inpcb: alloc =  2, free =   0, max =  2
Socket: 40168018 ACTIVE pcb = 401680a8, state = 4000
Socket: 401682b8 ACTIVE pcb = 40168348, state = 4022 CONNECTED CANTRCV
t 9
Kyu, ready> Wang state, count = 12, 44
I now have wireshark running, and there is nothing remarkable. The Kyu side has sent data packet "PK167", and the linux side has sent the expected ACK. The info from the "t 9" command indicates that the test is in the midst of the transfer, having sent piece 44 of 98. Nothing special about that, and it corresponds to the markings in the wireshark packet. The pieces are 2000 bytes in size, but for whatever reason they go out in 500 byte pieces. 44*4 is 176, which is in the ballpark of 167.

The big question is why the connection is marked as "CLOSE_WAIT" as well as CANTRCV. It is as though the tcp protocol thinks that the linux side has sent a FIN, but it hasn't.

Using "u wang_thr" to see what the thread is doing, I see:

PC = 40010498 ( change_thread+1c8 )
Called from thr_unblock+178 -- 4000fd14
Called from sem_unblock+70 -- 40010bf4
Called from user_unlock+1c -- 40023a10
Called from tcp_send+88 -- 4002ee44
This is exactly where I always see it (stuck in thr_unblock()) which itself is odd. This would be where it is releasing the lock. This is the last line of code in tcp_send and should never get stuck.

So which came first? The transition to close/wait or the hang? Oddly enough, I see the socket in this state all of the time, even when the test is running as it should. However, when I make a telnet connection and peek at it, I see:

Kyu, ready> t 1
INPCB: 40168348      Establshed -- local, foreign: 192.168.0.73 114 .. 192.168.0.5 43324
INPCB: 40168130          Listen -- local, foreign: 0.0.0.0 114 .. 0.0.0.0 0
Socket: 401680a0 ACTIVE pcb = 40168130, state = 4000
Socket: 401682b8 ACTIVE pcb = 40168348, state = 4002 CONNECTED

The above is exactly what one would expect. So why do I always see the "half closed" thing when I peek at it while doing a big transfer?

Is the thread just being starved somehow? It doesn't look like it with no thread at higher priority marked as READY, but an experiment would confirm this. I can run some other thread at a priority just above the thread in question and find a way to monitor it. I do this and that thread (at priority 32) runs just fine Oddly enough, I see wang_thr in a different state now:

PC = 400108e4 ( resched+288 )
Called from thr_block+7c -- 4000fa40
Called from sem_block_cpu+4c -- 40010e10
Called from sem_block+24 -- 40010e3c
Called from sbwait+38 -- 40030f78
Called from sosend+310 -- 4002f164
Called from tcp_send+80 -- 4002ee3c
Called from big_wangdoodle+d4 -- 40032218
The above was a "one time" event, but it has me wondering. Do I ever block in sosend? And do I handle it properly when I do. I have it print a message when it does block and then run the test again. It hums along for 50 or so tests without ever indicating that it needed to block in sosend(). In fact it runs all the way to a lockup without ever indicating that it needs to block.

I find this surprising, but maybe it really is fast enough to clear the buffer and keep things moving. The socket buffers themselves are only 8K (8*1024) bytes in size. What provides flow control for all of this?

I added a switch to let me examine what change_thread was doing after a lockup. It looks like this:

Switch wang_thr --> tcp-bsd
Switch tcp-bsd --> tcp-timer
Switch tcp-timer --> wang_thr
Switch wang_thr --> tcp-bsd
Switch tcp-bsd --> tcp-timer
Switch tcp-timer --> wang_thr

Now I am focusing on the business of sosend() calling sbwait(). This seems to rarely happen and would be the occasion where tcp_send would block. I added a printf message when it does decide to block, then printout of which connection is currently being processed on the Kyu side. Just having the message that it blocked is not enough, I need to know that it blocked on the latest connection and that led to the hang.

The fact that I always see the connection in close-wait makes it look like the entire 200K of data has been processed and buffered and that we have shut down the protocol from our side and are waiting for buffered data to be processed. I can't see how this could possibly be the case, but that is the appearance things have.

It actually looks like there are two situations that can yield a hang. One is the hang in sbwait() and in thr_block(). This looks we we decided to block due to lack of space in the socket buffer and never get awakened.

The second situation is the one where we are in user_unlock() and thr_unblock(). Here we did not choose to block and it is unclear what is going on. This is where we see this cycle endlessly repeating:

Switch wang_thr --> tcp-bsd
Switch tcp-bsd --> tcp-timer
Switch tcp-timer --> wang_thr
What is particularly unclear is what "wang_thr" is doing when control returns to it. Some preliminary study of this made me think we had a Kyu bug of some kind -- and maybe we do.

What is going on inside thr_unblock() ?

Whenever I type the "l" command during a hang, I see the PC stuck at 400104c8. The command "u wang_thr" shows the same:
PC = 400104c8 ( change_thread+1f8 )
Called from thr_unblock+178 -- 4000fd14
Called from sem_unblock+70 -- 40010c0c
The code here is:
400104c4:       e3a02000        mov     r2, #0
400104c8:       e58320ec        str     r2, [r3, #236]  ; 0xec
400104cc:       e10f0000        mrs     r0, CPSR
400104d0:       e38000c0        orr     r0, r0, #192    ; 0xc0
400104d4:       e129f000        msr     CPSR_fc, r0
In the C code this is:
	/* Cancel any idea of a thread that yielded
         * to us.
         */
        cur_thread->yield = (struct thread *) 0;

        /*
         * Remember, tp->mode tells how the thread should be resumed,
         *  not how it should be saved.
         */

        /* We need to lock before setting cur_thread or
         * we may take an interrupt in the wrong thread
         * structure.
         */
        INT_lock;
        wang_hook ( cur_thread, new_tp );
        prior_thread = cur_thread;      /* new 12-29-2022 */
        cur_thread = new_tp;

With the Kyu console responsive, we can dump memory using a command like:

d 40583e00 32
The count is a count in words.
This is in change_thread and it is performing a switch to "tcp_bsd" and seems to always be doing so.

Sampling Bias

Several things perplex me. Why do I always see this "half connected" state for the connection socket? Why do I always see the stuck "wang_thr" thread at exactly the same PC? Perhaps the shell, even though running at the highest priority, is actually being synchronized in some way that I never expected. Some experiments might serve to confirm this.

My first experiment was to put some delay into the routine that grinds out the "big" data. This made no difference and I still see close-wait.

Interestingly, if I telnet to port 114 and type "big", it works fine (all the data is ASCII and telnet does not seem to mind at all). When I use a "t 1" command I see "Established" for the socket state for this connection. So I am still confused. What is different between using telnet and using netcat? (My Python script is just a supervisor that runs netcat to do all the network IO).

Aha! I know what is going on with netcat. Netcat has sent a FIN, even though lots of data has yet to be received. It sent the 4 characters "big\n" and since that is all it had to send, it concluded that with a FIN. All this is entirely proper. The TCP protocol knows that has happened and when tcp_close() is called, it will shut down the whole game. Note that the flag CANTRCV has nothing at all to do with all the ACK packets that linux needs to send to deal with the data coming its way. As has been pointed out, a TCP connection consists of two independent data streams moving in opposite directions. The CANTRCV applies to data flowing from the linux side to the Kyu side. The ACK packets are part of the stream in the other direction.


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org