I made a run and after 1526 exchanges I get a bad checksum in tcp_input().
tcp_input 2, bad cksum = f993 m, tlen, len = 40168668, 12336, 12356 mbuf_show (bad cksum): 40168668 next: 40168448 nextpkt: 00000000 len: 500 data: 401643f4 (-17032) type: 1 (MT_DATA) flags: 00000001 EXT 40168668 48 84 16 40 00 00 00 00 f4 01 00 00 f4 43 16 40 40168678 01 00 01 00 34 00 00 00 00 00 00 00 18 3e 16 40 40168688 ab ab ab ab 00 08 00 00 ab ab ab ab 45 00 20 00 40168698 d3 92 00 40 40 06 26 52 c0 a8 00 05 c0 a8 00 49 401686a8 86 c4 00 72 42 b5 6e b2 10 62 73 4e 80 10 0a 2b 401686b8 e7 ac 00 00 01 01 08 0a 87 c7 b4 60 00 00 0a d0 401686c8 62 69 67 0a 87 c7 ab 88 34 30 50 4b ab ab ab ab 401686d8 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab BSD panic: tcp input bad checksum Current thread is tcp-input SP = 4056df08, FP = 4056df1c Called from bsd_panic+24 -- 40023f80 Called from tcp_input+20c -- 40026e68 Called from tcp_bsd_process+194 -- 400248d0 Called from tcp_thread+180 -- 40024644 Called from thr_exit -- 4000f8dc INPCB: 40168348 8 -- local, foreign: 192.168.0.73 114 .. 192.168.0.5 34500 INPCB: 40168130 Listen -- local, foreign: 0.0.0.0 114 .. 0.0.0.0 0 locker count: 0 Input thread: busy Timer thread: idle User lock: busy mbuf: alloc = 3, free = 17, max = 20 mbufcl: alloc = 1, free = 255, max = 16 sock: alloc = 2, free = 16, max = 18 inpcb: alloc = 2, free = 2, max = 4 Socket: 401680a0 ACTIVE pcb = 40168130, state = 4000, rcv = 0, snd = 0 Socket: 401682b8 ACTIVE pcb = 40168348, state = 403b, rcv = 0, snd = 1000 NOFDREF CONNECTED CANTSEND CANTRCV Socket: 40169848 -inactive- pcb = 00000000, state = 0000, rcv = 0, snd = 0 ... Kyu input queue size: 0 TCP input queue size: 0 Kyu output queue size: 0 Netbuf head: 406b5990 510 netbuf available 510 netbuf on free list 512 netbuf configured Clock: 1384 PANIC: BSD tcp ----- endEverything looks basically OK, but the data contained in the input mbuf is corrupted. And indeed the errors recently reported have all involved corrupt data in some mbuf that yield either cksum errors in tcp_input() or a panic in m_copym() that takes place during output. I already have a suspicion, but want to study the above mbuf to see if I can get idea about what is wrong with it.
It has a 500 byte length (exactly the expected payload for an output), but why am I seeing a 500 byte packet on input (that is a major clue in itself). Also look at the value for "len" in the first line -- it is 12356 bytes. This seems highly unlikely for a TCP packet, and does not correspond to the 500 byte value in the mbuf header.
I am not going to spend more time on mbuf details. My suspicion is that code in tcp/kyu_main.c in tcp_bsd_process() needs to be within the "big" tcp lock. In particular the code that allocates mbufs and then copies from a Kyu netbuf into a mbuf chain using mb_devget(). I used to perform this copy without a lock, then wrap just the call to tcp_input() in the lock. It only makes sense that the call to mb_devget() needs to be locked, and so it now is.
Away with another run. Still going after almost 2 hours and 7000 transfers. Watching this with Wireshark, sometimes I see Kyu sending single 500 byte packets, and other times it sends two back to back before getting an ACK. Linux is generally advertising a 64K window. All this keeps Wireshark using 100 percent of the CPU (and I hear the fan running in my linux machine). Every now and then I see a message saying "Send blocked", then almost immediately a "W wakeup" followed by "Send unblocked" -- just as it should be.
It is 9 PM. The test has been going non-stop for 6.4 hours. 25,799 transfers. I make the bold move of killing Wireshark. Nothing changes in the transfer timing. I really don't know how Wireshark kept up with all the traffic, translating packets into ascii and posting lines, but it did, albeit pushing the CPU to 99 percent the whole time. Now with wireshark gone, the linux system is almost quiescent. Each transfer still takes 1.14 seconds.
Wangdoodle connection 73405 .. finished (73405) Wangdoodle connection 73406 ..Send blocked, 65879 seconds W wakeup: tcp_input data 1 Send unblocked, 65879 seconds finished (73406) Wangdoodle connection 73407 .. finished (73407)
It is noon the next day. Have gone almost 24 hours without a problem. On rare occasion I see the above (which is entirely proper).
And it is now 3PM the next day. I have gone over 24 hours without a problem. 84,000 transfers in 75379 seconds. One last thing does worry me, but I will deal with that in the next page.
Kyu / tom@mmto.org