Kyu is "all locked up".
The traceback from the first abort looks like:
data abort in thread tcp-input pc : [<4001ce38>] lr : [<400245d0>] sp : 40573e60 ip : 0000d8a6 fp : 40573e7c r10: 00000002 r9 : 5bf42ee8 r8 : 5ffb4b70 r7 : 5bf452a0 r6 : 40000000 r5 : 00000002 r4 : 4005c100 r3 : 60000000 r2 : 00000050 r1 : 60000001 r0 : 5ffffff6 cpsr: 20000013 Flags: nzCv IRQs on FIQs on Mode SVC_32 40573e60 01c82000 00000001 40163600 5ffffff6 40573e70 4016360b 60000001 40573eac 40573eac PC = 4001ce38 ( memcpy+48 ) Leaf routine Called from tcp_output+116c -- 4002a970 Called from tcp_input+1c54 -- 4002842c Called from tcp_bsd_process+194 -- 4002451c Called from tcp_thread+180 -- 40024258 Called from thr_exit -- 4000f89cThe traceback shows us in memcpy and neither r2 nor r3 look like acceptable pointers. Also, there is no direct call to memcpy in tcp_output. Just above the address given is a call to ip_output which does call memcpy (to copy into a Kyu netbuf).
The assembly code shows a store to [r3] where r3 (from above) is 0x60000000, which would certainly cause a data abort.
The traceback from the second abort looks like:
Send blocked, 9030 seconds *** unexpected emac Rx int status: 40000300 Unusual desc status: 314b504b data abort in thread tcp-timer pc : [<400166dc>] lr : [<4000550c>] sp : 40577f68 ip : 0000001c fp : 40577f74 r10: 00000002 r9 : 5bf42ee8 r8 : 5ffb4b70 r7 : 5bf452a0 r6 : 40000000 r5 : 00000002 r4 : 4005c100 r3 : 314b504b r2 : 65000000 r1 : 503167a8 r0 : 400aba68 cpsr: 200001d3 Flags: nzCv IRQs off FIQs off Mode SVC_32 40577f60 40577f74 400003b4 00003148 314b504b 40577f70 40577f94 4000550c 40577f8c 40000300 40577f80 50313631 00002500 ffc102d9 00000020 40577f90 40577fb4 40005734 40577fa4 00000000 PC = 400166dc ( netbuf_alloc_i+48 ) Called from rx_handler+d4 -- 4000550c Called from emac_handler+bc -- 40005734 Called from do_irq+ac -- 40001164 Called from irq+34 -- 40000434 Called from thr_block_q+58 -- 4000fc60 Called from thr_exit+60 -- 4000f8fc Called from sem_unblock+70 -- 40010e24Notice the messages from the emac driver before we get the data abort. In particular notice the odd value 314b504b which pops up several places.
A look through the code suggests that the netbuf free list has gotten corrupted.
An important point again. Resist every temptation to make code changes while a long test like this is running (or save kyu.dump if you do).
#define BOARD_RAM_SIZE 0 /* Force probing */ #define BOARD_RAM_MAX 0x80000000 /* 2G */ #define BOARD_RAM_START 0x40000000It turns out I have been down this road before and have the following routine in armv7/ram.c --
int valid_ram_address ( unsigned long addr );
So I just need to sprinkle calls to this in sensible places -- such as netbuf_alloc_i, netbuf_alloc, and in ip_output before the call to memcpy. Note that this call in ip_output is using a just allocated netbuf, the suspicion starts to center on the netbuf allocator.
User lock: idle mbuf: alloc = 1, free = 7, max = 8 mbufcl: alloc = 0, free = 256, max = 4 sock: alloc = 1, free = 1, max = 2 inpcb: alloc = 1, free = 1, max = 2 Kyu input queue size: 0 TCP input queue size: 0 Netbuf head: 406b4cb0 512 netbuf available 512 netbuf on free list 512 netbuf configuredNo netbuf leak like I sort of expected. So, we will let it run until it finds trouble. And it doesn't take long. I get:
connection 2078 ..cksum: out of data *** *** *** tcp_input 2, bad cksum = 1015 m, tlen, len = 40168778, 12336, 12356 finished (2078) Wangdoodle connection 2079 ..netbuf_alloc rv, free = 406b4cb0, free PANIC: netbuf_alloc_i -- bad next addressI add some more debug and try again. This leads to an "amusing" debug loop spewing out traceback forever and ever. I see this once I pull the power cord:
PC = 4002b5a8 ( tcp_fasttimo+4c ) Called from tcp_timer_func+6c -- 4002410c Called from thr_exit -- 4000f89c data abort in thread tcp-timer CPU: 0 pc : [<4002b5a8>] lr : [<4002b56c>] sp : 40577fe0 ip : 400240a0 fp : 40577fec r10: 00000002 r9 : 5bf42ee8 r8 : 5ffb4b70 r7 : 5bf452a0 r6 : 40000000 r5 : 00000002 r4 : 4005c100 r3 : abababab r2 : 40168560 r1 : 60000113 r0 : 60000113 cpsr: a0000013 Flags: NzCv IRQs on FIQs on Mode SVC_32 40577fe0 abababab 40168560 40577ffc 4002410c 40577ff0 40577ffc 00000000 00000000 4000f89cI find it sort of odd that I am getting these entirely new sort of things that look like memory corruption. But probably what I ought to do is to buckle down and code up that output queue that I am sure that I need. If nothing else, I can get contention between the tcp-timer thread and the tcp-input thread.
My impression of things is that the path is reentrant all the way to calling net_send() in net/net_main.c. This routine is simple and is the place to set up the output queue. At this time it just calls board_net_send(), which is by no means guaranteed to be reentrant.
So, I coded this up (which was really no big deal). Now the big surprise is that things are worse! Unless I made some stupid coding error (entirely possible) this has somehow exacerbated some timing issue. Now I cannot even go 10 transfers before some kind of panic takes place.
Interestingly, wireshark now shows the Kyu side sending 2 packets back to back
before the linux side sends an ACK. Why it doesn't just send one big packet
is still a mystery (linux advertises a 6418 byte window according to wireshark.)
My guess is that with an output buffer it can quickly send the first packet on
the way and have time to send a second before linux can respond with an ACK.
All that is great, but why do I get "BSD panic: m_copym 2" now?
I do see this in tcp.h:
#define TCP_MSS 512But then why am I seeing 500 bytes instead of 512?
At any rate, this MSS business involves options. And as I remember, I bypassed handling TCP options for one reason or other. Typically (I am reading) a linux system will advertise an MSS of 1460 bytes. "The book" says that almost every system they investigates used a 1024 byte MSS size. So the 500 (or 512) value is something to investigate someday, but a rabbit hole distraction as thing time
Kyu / tom@mmto.org