January 5, 2023

BSD TCP/IP for Kyu - Data aborts - continued

A new day has begun with new debugging features in place. I am no longer getting actual data aborts now since I have added my own checks (such as testing pointers with "valid_ram_address()". Now I get panics with my own debug output.

I made a run and after 1526 exchanges I get a bad checksum in tcp_input().

tcp_input 2, bad cksum = f993
m, tlen, len = 40168668, 12336, 12356
mbuf_show (bad cksum): 40168668
 next:    40168448
 nextpkt: 00000000
 len: 500
 data: 401643f4 (-17032)
 type: 1 (MT_DATA)
 flags: 00000001 EXT
40168668  48 84 16 40 00 00 00 00 f4 01 00 00 f4 43 16 40 
40168678  01 00 01 00 34 00 00 00 00 00 00 00 18 3e 16 40 
40168688  ab ab ab ab 00 08 00 00 ab ab ab ab 45 00 20 00 
40168698  d3 92 00 40 40 06 26 52 c0 a8 00 05 c0 a8 00 49 
401686a8  86 c4 00 72 42 b5 6e b2 10 62 73 4e 80 10 0a 2b 
401686b8  e7 ac 00 00 01 01 08 0a 87 c7 b4 60 00 00 0a d0 
401686c8  62 69 67 0a 87 c7 ab 88 34 30 50 4b ab ab ab ab 
401686d8  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab 
BSD panic: tcp input bad checksum
Current thread is tcp-input
 SP = 4056df08,  FP = 4056df1c
Called from bsd_panic+24 -- 40023f80
Called from tcp_input+20c -- 40026e68
Called from tcp_bsd_process+194 -- 400248d0
Called from tcp_thread+180 -- 40024644
Called from thr_exit -- 4000f8dc
INPCB: 40168348               8 -- local, foreign: 192.168.0.73 114 .. 192.168.0.5 34500
INPCB: 40168130          Listen -- local, foreign: 0.0.0.0 114 .. 0.0.0.0 0
locker count: 0
Input thread: busy
Timer thread: idle
User lock: busy
  mbuf: alloc =  3, free =  17, max = 20
mbufcl: alloc =  1, free = 255, max = 16
  sock: alloc =  2, free =  16, max = 18
 inpcb: alloc =  2, free =   2, max =  4
Socket: 401680a0   ACTIVE   pcb = 40168130, state = 4000, rcv =     0, snd =     0
Socket: 401682b8   ACTIVE   pcb = 40168348, state = 403b, rcv =     0, snd =  1000 NOFDREF CONNECTED CANTSEND CANTRCV
Socket: 40169848 -inactive- pcb = 00000000, state = 0000, rcv =     0, snd =     0
...
Kyu input queue size: 0
TCP input queue size: 0
Kyu output queue size: 0
Netbuf head: 406b5990
510 netbuf available
510 netbuf on free list
512 netbuf configured
Clock: 1384
PANIC: BSD tcp ----- end
Everything looks basically OK, but the data contained in the input mbuf is corrupted. And indeed the errors recently reported have all involved corrupt data in some mbuf that yield either cksum errors in tcp_input() or a panic in m_copym() that takes place during output. I already have a suspicion, but want to study the above mbuf to see if I can get idea about what is wrong with it.

It has a 500 byte length (exactly the expected payload for an output), but why am I seeing a 500 byte packet on input (that is a major clue in itself). Also look at the value for "len" in the first line -- it is 12356 bytes. This seems highly unlikely for a TCP packet, and does not correspond to the 500 byte value in the mbuf header.

I am not going to spend more time on mbuf details. My suspicion is that code in tcp/kyu_main.c in tcp_bsd_process() needs to be within the "big" tcp lock. In particular the code that allocates mbufs and then copies from a Kyu netbuf into a mbuf chain using mb_devget(). I used to perform this copy without a lock, then wrap just the call to tcp_input() in the lock. It only makes sense that the call to mb_devget() needs to be locked, and so it now is.

Away with another run. Still going after almost 2 hours and 7000 transfers. Watching this with Wireshark, sometimes I see Kyu sending single 500 byte packets, and other times it sends two back to back before getting an ACK. Linux is generally advertising a 64K window. All this keeps Wireshark using 100 percent of the CPU (and I hear the fan running in my linux machine). Every now and then I see a message saying "Send blocked", then almost immediately a "W wakeup" followed by "Send unblocked" -- just as it should be.

It is 9 PM. The test has been going non-stop for 6.4 hours. 25,799 transfers. I make the bold move of killing Wireshark. Nothing changes in the transfer timing. I really don't know how Wireshark kept up with all the traffic, translating packets into ascii and posting lines, but it did, albeit pushing the CPU to 99 percent the whole time. Now with wireshark gone, the linux system is almost quiescent. Each transfer still takes 1.14 seconds.

Wangdoodle connection 73405 .. finished (73405)
Wangdoodle connection 73406 ..Send blocked, 65879 seconds
W wakeup: tcp_input data 1
Send unblocked, 65879 seconds
 finished (73406)
Wangdoodle connection 73407 .. finished (73407)

It is noon the next day. Have gone almost 24 hours without a problem. On rare occasion I see the above (which is entirely proper).

And it is now 3PM the next day. I have gone over 24 hours without a problem. 84,000 transfers in 75379 seconds. One last thing does worry me, but I will deal with that in the next page.


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org