My current (new) thought is that although it now takes 5 full seconds to transfer a 1M file on the Orange Pi, I really may have the network in 100 Mbit mode. This is what the lights on my network switch say after all. I tried 3 experiments timing the 1M transfer
The new idea is that I have my PHY and emac initialization correct and something else is producing the slow transfer rates. 5 seconds is even long for 10 mbit. I would expect 10 mbit on the BBB to be maybe 1.7 seconds (i.e. 10 times longer than the 100 mbit time). 5 seconds is out of the question regardless.
It seems unlikely to be the TCP code itself, since this is exactly the same on both the BBB and the Orange Pi. I don't know where this big delay is coming from, but it is unique to the Orange Pi.
Wireshark shows that the transfer takes place 500 bytes at a time. With 1M byte, there are 2000 of these "chunks", each consisting of a SYN packet with a 500 byte payload and an ACK packet. 5/2000 = 2.5 milliseconds per "chunk". So we are looking to find out where just over 2 milliseconds is being wasted. Wireshark timings may be helpful.
Now consider the BBB where the 1M transfer happens in 0.17 seconds. Take 0.17/2000 = .085 milliseconds per "chunk".
And consider time on the wire for a 500 byte (2000 bit) message. Clearly 2 milliseconds is crazy.
I wrote some code to use the CCNT hardware counter to timestamp different events as the TCP transfer is taking place.
stcp send tx rx rtcp 0 880 56 3368 39 1979 880 0 0 0 167 880 8 0 0 167 929 55 70 186 5131 880 0 0 0 167 932 8 95 0 168 880 8 0 0 167 895 55 105 179 5124 880 0 0 0 167 940 15 87 0
The times shown are in Mhz ticks since the previous "event". So the "send" column shows the time from the "stcp" event (which is TCP putting the packet on the output queue) to the actual "send" event in the network driver that hands the packet to the hardware. The "tx" event is the interrupt that says that the packet has been sent and hardware is ready for another.
It is alarming that the "send" column shows a 0.88 millisecond delay after
TCP decided to send a packet. This is systematic and significant.
And here are the results from the BBB --
stcp send tx rx rtcp 0 -1470440 0 0 0 8 12 0 0 0 7 12 0 0 0 7 12 71 0 0 34 11 0 0 0 7 12 0 0 0 7 12 0 0 0 7 12 71 538 11 38 11 0 0 0 8 12 0 0 0The differences are quite striking. My attention is particularly captured by the "send" column in the Orange Pi data, but the occasional 5 and 2 millisecond delays in the "stcp" column for the Orange Pi are certainly of concern. The events are:
stcp - TCP code sends a packet send - network driver sends the packet tx - network driver gets interrupt after sending rx - network driver gets receive interrupt rtcp - TCP code gets packet to process
It is highly suggestive that the counts in the "send" column are so close to 1 ms, which is the system timer rate. There is no reason why packets should "stall" in the output queue for 1 clock tick, but they seem to. And they do not seem to on the BBB. Consider that if each "chunk" suffers a 1 ms delay of this sort and there are 2000 "chunks" in a 1M transfer, this delay totals to 2 seconds.
So who are the players who deal with the network output queue? The tcp-input thread running at priority 24, or the user thread at priority 31 making a call to tcp_output (not as likely) is putting the packet on the queue. The net-out thread running at priority 21 should be removing the packet and calling emac_send(). (That is it's sole purpose for existing, so it runs at the high priority of 21). It is hard to see why a 1 second delay should be going on here (and/or why it is not on the BBB).
Here is the thread mix on the BBB (the Orange Pi is exactly the same):
Thread: name ( &tp ) state sem pc sp pri Thread: shell (400a7a20) READY I 40004620 40568000 11 Thread: net-in (400a7650) SEM I net-inq 40011170 40570000 20 Thread: net-out (400a755c) SEM J net-outq 40010cd0 40572000 21 Thread: net-timer (400a7468) REPEAT I 40011170 40574000 22 Thread: tcp-input (400a7838) SEM I tcp-inq 4001116c 4056c000 24 * Thread: tcp-timer (400a7744) READY I 40011170 4056e000 25 Thread: wangdoodle (400a7280) SEM J socket 40010cd0 40578000 31 Thread: idle (400a792c) READY C 400111c4 4056a000 1234I should note here that I am giving thought to getting rid of the net-out thread. The device driver itself could pull packets from the output queue directly and that could be handled in the Tx interrupt routine. This is pretty much what goes on in NetBSD. We certainly need an output queue, but perhaps not a task dedicated to emptying it. Nonetheless, I want to leave things as they are and find out what is different between the BBB and the OrangePi with the current setup.
Here is the result. I am surprised. I expected the delay to appear in the B column and to be something to do with one task signaling another and some other task hogging the processor. What we see though is the 0.888 millisecond delay now fairly evenly split between the "A" and the "send" column.
stcp A B send tx rx rtcp 0 413 9 459 56 3369 39 1979 420 9 451 - - - 167 421 8 451 8 - - 167 496 8 451 56 67 215 5148 413 9 459 - - - 167 479 9 459 8 81 - 168 413 8 458 8 - - 167 460 9 451 56 81 178 5139 413 9 451 - - - 167 476 9 459 15 84 -I'll note that I did do an experiment yesterday to simulate the business of one task signaling another with a semaphore and got exactly the same kind of delays of around 8 or 9 Mhz units. This result confirms that experiment testing the right thing.
We could start spinning theories, or we could start moving the A and B sample points in the code to pin down where the delay is taking place. What I do next is in net_main.c -- I put "A" just before locking interrupts and then adding the packet to the queue, I put "B" where A used to be (after the packet has been added to the queue, interrupts have been unlocked, and just before firing the signal with the semaphore. Now we get this:
thread stcp A B send tx rx rtcp wang_thr - 419 1 459 56 3376 39 wang_thr 1980 413 1 467 - - - wang_thr 168 411 1 467 8 - - wang_thr 167 496 1 467 55 37 180 wang_thr 5124 419 1 459 - - - wang_thr 168 522 1 459 8 45 - wang_thr 168 412 1 459 15 - - wang_thr 168 507 1 467 56 26 179 wang_thr 5131 412 1 467 - - - wang_thr 167 513 1 459 8 54 -Here I have prettied up the output a bit and added a "thread" column. This shows us which thread is actually calling tcp_ouput to send the packet, and it turns out to be the user thread calling "tcp_send(). To my surprise (again) the delay is not between points A and B where we lock interrupts and put the packet on the queue, it is some time before.
Now I move A and B in ip_arp_send() and rerun the above.
thread stcp A B send tx rx rtcp wang_thr - 404 8 461 56 3369 46 wang_thr 1980 403 8 468 - - - wang_thr 168 404 8 468 8 - - wang_thr 167 497 8 460 56 34 178 wang_thr 5125 412 8 461 - - - wang_thr 168 477 8 461 8 82 - wang_thr 167 403 7 469 15 - - wang_thr 168 495 7 469 56 29 179 wang_thr 5131 412 8 461 - - - wang_thr 168 490 8 461 8 67 -We get the above, showing that the delay is prior to calling ip_arp_send(). Now we position A and B in net_ip.c putting A at the very start of ip_send() and B right before the call to ip_arp_send(). We get this:
thread stcp A B send tx rx rtcp wang_thr - 390 14 476 55 3369 39 wang_thr 1979 390 14 476 - - - wang_thr 167 398 14 468 9 - - wang_thr 167 440 14 468 55 78 179 wang_thr 5138 390 14 477 - - - wang_thr 167 481 14 477 8 56 - wang_thr 168 390 14 476 8 - - wang_thr 168 431 14 476 56 79 179 wang_thr 5124 397 14 469 - - - wang_thr 168 464 14 469 8 80 -Now the delay is moving back into TCP code. Now we are in the routine ip_output() where the "stcp" sample point is.
thread stcp A B send tx rx rtcp wang_thr - 5 385 490 55 3370 39 wang_thr 1980 6 414 483 - - - wang_thr 167 6 392 483 8 - - wang_thr 167 5 453 490 55 52 208 wang_thr 5147 6 385 491 - - - wang_thr 167 6 466 483 8 72 - wang_thr 168 5 392 483 8 - - wang_thr 168 6 438 483 56 75 186 wang_thr 5130 6 384 491 - - - wang_thr 167 5 447 490 8 85 -Now we have located the delay between points A and B (within ip_output()). I move A down and B up in the routine so they now surround a loop that copies mbuf data to a Kyu netbuf. I get this, which shows that the delay is indeed in that loop:
thread stcp A B send tx rx rtcp wang_thr - 6 379 496 55 3369 39 wang_thr 1980 6 385 489 - - - wang_thr 167 6 386 489 8 - - wang_thr 167 5 454 488 55 52 186 wang_thr 5132 6 378 496 - - - wang_thr 167 6 479 497 8 46 - wang_thr 168 6 378 496 8 - - wang_thr 168 6 451 488 56 54 179 wang_thr 5137 6 378 489 - - - wang_thr 167 6 460 497 16 64 -The next experiment gets tricky since I don't know for sure how many mbuf the loop will process, and having multiple calls to A and B will lead to some ambiguity -- but I just go ahead and position A and B around what it really the only interesting call in that loop, which is memcpy().
thread stcp A B send tx rx rtcp wang_thr - 2 338 498 55 3369 39 wang_thr 1980 2 346 490 - - - wang_thr 167 2 345 490 8 - - wang_thr 167 2 433 490 55 32 179 wang_thr 5130 3 337 498 - - - wang_thr 167 3 468 497 8 16 - wang_thr 168 3 345 489 8 - - wang_thr 167 3 423 489 56 42 178 wang_thr 5137 3 337 497 - - - wang_thr 168 3 462 498 8 22 -Well -- who would have thought? How can memcpy be our culprit?
We will take that story to a whole new page.
Kyu / tom@mmto.org