ncat -v -v -l -p 2022 > levi.test < /dev/null ncat -v -v -l -p 2022 < /dev/null | wc -cOn the Kyu side, a loop like this was used:
while ( total > 0 ) { if ( total < LOTS_CHUNK ) { tcp_send ( so, buf, total ); break; } else { tcp_send ( so, buf, LOTS_CHUNK ); total -= LOTS_CHUNK; buf += LOTS_CHUNK; } }This worked fine with a chunk size of 2048, but failed with a chunk size of 8000. Further experimenting showed that 5000 works, but 6500 does not. Somewhere between 5000 and 6500 bytes exists some kind of threshold.
I expected this to fail in a different way than it did.
I expected that I would ask for 8000 bytes to be written
and that less would be actually be written and I would
need to arrange for tcp_send() to return the actual number
written. However what happens is that I get a panic!
The panic message is:
PANIC: m_copym
I may well do just this, but for now I want to dig a bit further.
Current thread is tcp-lots SP = 4056be70, FP = 4056be84 Called from bsd_panic+14 -- 400235bc Called from m_copym+dc -- 4002ceac Called from tcp_output+8fc -- 40029840 Called from sosend+688 -- 4002eedc Called from tcp_send+70 -- 4002e844 Called from lots_thread+b8 -- 40031e1c Called from thr_exit -- 4000f554 BSD panic: m_copym 3 PANIC: BSD tcpIt is important to note that various amounts of data have been successfully sent. I have even seen one test with the chunk size set to 6500 that worked, while most do not. I am not entirely sure if this is a bug I have caused or an ancient bug (perhaps it should be called limitation) in the BSD code.
Looking at wireshark, I see data payloads that are typically multiples of 100 (such as 300, 400, 100, 500). The value of 500 seems typical. Usually a 48 byte payload starts the transfer for some reason.
READ SUCCESS for EID 1730 [192.168.0.73:114] (500 bytes) READ SUCCESS for EID 1738 [192.168.0.73:114] (1500 bytes) READ SUCCESS for EID 1746 [192.168.0.73:114] (48 bytes)This indeeds totals to 2048 bytes, but wireshark shows only packets with 500 byte payloads.
But never mind the details of packet fragmentation, the real problem here is that I fail sometimes (maybe 1 out of 10 times) with a 2048 byte chunk size. When it does fail, I see:
Current thread is wang_thr SP = 4056ad98, FP = 4056adac Called from bsd_panic+14 -- 400235bc Called from m_copym+dc -- 4002ceac Called from tcp_output+8fc -- 40029840 Called from sosend+688 -- 4002eedc Called from tcp_send+70 -- 4002e844 Called from big_wangdoodle+48 -- 4003164c Called from one_wangdoodle+fc -- 40031790 Called from run_wangdoodle+88 -- 40031840 Called from thr_exit -- 4000f554 BSD panic: m_copym 3 PANIC: BSD tcpI added some code to display how I am handling the mutual exclusion lock and it seems clear that it is not being held when this panic occurs.
So I wrapped my call to sosend() with net_lock and net_unlock and the bug persists.
What I have done is to identify the basic "entries" into the TCP code from the outside world as follows:
* -- tcp_input() * -- timer calls * -- user calls * --- send * --- receive * --- bind * --- accept * --- connect * --- closeI use one mutex semaphore and wrap each of the above calls with it. I also have to take note of places in send, receive, accept, and connect that block waiting and unlock during the wait. In general the wait is for something that will be driven by tcp_input (i.e. packet arrival) and not unlocking and letting tcp_input to run will yield immediate deadlock.
I make all of this conditional on the macro "BIG_LOCKS".
It seems to work.
*** *** *** tcp_input 2, bad cksum = ffffThere are still serious errors. It mostly just locks up. I have to do several transfers of about 100K before it does, but then it locks up.
I think this is fixed. The clue is that the hang always happened when the entire transfer was finished, i.e. during close. And an examination of close revealed that I had a deadlock by not releasing the lock while I waited for the connection to initiate the disconnect. A simple oversight, but it led to hours of debugging.
To aid the debugging, I changed my test to generate recognizable data packets. For whatever reason, even though I do 2000 byte writes, the packets have 500 byte payloads. I don't particularly care, but I took some pains that the final 500 bytes would be recognizable as the end of the data.
I am using a python script to pound away over and over doing 200K transfers.
4654 tests (200K * 4654 = 930M) 1234 tests (200K * 1234 = 247M)Once a hang occurs, Kyu is entirely crashed, the console is unresponsve, so there is no way to have a clue about what is going on.
As an experiment I changed the size of the default thread stack from 4096 to 16384 bytes. This value is in kyu.h and it is necessary to do a "make clean ; make" to rebuild Kyu after such a change.
Kyu / tom@mmto.org