Thread: name ( &tp ) state sp pri * Thread: shell (400a7984) READY J 40568000 11 Thread: net (400a75c4) SEM J 40578000 12 Thread: net_slow (400a74d4) REPEAT C 4057c000 13 Thread: tcp-bsd (400a77a4) SEM I 40570000 14 Thread: tcp-timer (400a76b4) READY C 40574000 15 Thread: wangdoodle (400a72f4) SEM J 40584000 30 Thread: wang_thr (400a73e4) DEAD I 40580000 31 Thread: wang_thr (400a7204) READY I 40580000 31 Thread: idle (400a7894) READY C 4056c000 1234Notice that I have two threads with the name "wang_thr" and one is marked "DEAD". This should not happen. The DEAD state indicates a thread that is exiting and should be destroyed and removed from the thread list in short order. A quick look at this suggests that a lock is needed during the thr_exit() call. This is sort of cosmetic, but may hint of deeper problems.
In truth, any access to the thread list ought to be locked. When I first wrote Kyu I made it my aim to minimize locking, introducing it only when the need for it was clearly demonstrated (by things not working). A different and more proper approach would be to lock all accesses to shared resources (such as the thread list).
My first pair of tests with the shell remaining useful during a TCP hang also revealed some shortcomings in some of my diagnostic tools, notably the 'u' command in the shell. I am going to fix those, as well as add locking to thr_exit() and then perform more testing.
Note that the kyu "shell" is the function "shell_main()" in the file tests.c. This is launched by user_init() in user.c -- but end users are expected to replace user_init() with a version of their own.
PC = 40010430 ( change_thread+1c8 ) Called from thr_unblock+178 -- 4000fcac Called from sem_unblock+70 -- 40010b74 Called from user_unlock+1c -- 40023990 Called from tcp_send+88 -- 4002edd0 Called from big_wangdoodle+88 -- 40032010 Called from one_wangdoodle+fc -- 40032168 Called from run_wangdoodle+78 -- 40032208 Called from thr_exit -- 4000f72c
An odd thing watching the above multiple times is that I never see the PC value changing, yet this is the READY thread and should be running all the time (except when it gets preempted by the shell when characters arrive. I added a column to the thread list display to show the PC, and tested this by watching a busy thread. Indeed I see the PC value changing as the thread runs, so all of this seems to be correct.
Thread: name ( &tp ) state pc sp pri * Thread: shell (400a7998) READY J 40010494 40568000 11 Thread: net (400a75d8) READY I 40010890 40578000 12 Thread: net_slow (400a74e8) REPEAT C 40015a64 4057c000 13 Thread: tcp-bsd (400a77b8) SEM J 40010494 40570000 14 Thread: tcp-timer (400a76c8) READY C 40023c34 40574000 15 Thread: wangdoodle (400a7308) SEM J 40010494 40584000 30 Thread: wang_thr (400a73f8) READY I 40016378 40580000 31 Thread: idle (400a78a8) READY C 40010950 4056c000 1234So I run a test again. It is critical not to hack and rebuild code while the test is running, because I rely on addresses in the file "kyu.dump" when I try to sort out what PC values mean.
I find it a bit odd that tcp-timer always seems to be READY. This is a thread run on a REPEAT, so I would expect it to be waiting most of the time. The PC address shown (at this time) is the address that a continuation would launch at, not necessarily the address it is currently running at.
The test locks up after 388 transfers. Now we see the tcp-timer thread showing "REPEAT". And when I check it repeatedly it is alternately either READY or REPEAT, which is just right. The unusual thing now is that "wang_thr" always shows the same PC value, namely 40010488. If it was running in some kind of loop, I would expect some changes.
Thread: name ( &tp ) state pc sp pri * Thread: shell (400a7998) READY J 40010494 40568000 11 Thread: net (400a75d8) SEM J 40010494 40578000 12 Thread: net_slow (400a74e8) REPEAT C 40015a64 4057c000 13 Thread: tcp-bsd (400a77b8) SEM I 400108f8 40570000 14 Thread: tcp-timer (400a76c8) REPEAT C 40023c34 40574000 15 Thread: wangdoodle (400a7308) SEM J 40010494 40584000 30 Thread: wang_thr (400a73f8) READY I 40010488 40580000 31 Thread: idle (400a78a8) READY C 40010950 4056c000 1234I use the "u" command to learn all I can about "wang_thr" --
Kyu, ready> u wang_thr Thread wang_thr, dump (-) Thread: wang_thr (400a73f8) READY I 40010488 40580000 31 state: READY stack: 40580000 mode: INT CPU: 0 pc : [<40010488>] lr : [<4000fd44>] sp : 40583ea0 ip : 40032228 fp : 40583ebc r10: 00000002 r9 : 5bf42ee8 r8 : 5ffb4b70 r7 : 5bf452a0 r6 : 40000000 r5 : 00000002 r4 : 4005c100 r3 : 400a73f8 r2 : 00000000 r1 : 00000000 r0 : 400a77b8 cpsr: 60000013 Flags: nZCv IRQs on FIQs on Mode SVC_32 PC = 40010488 ( change_thread+188 ) Called from thr_unblock+178 -- 4000fd44 Called from sem_unblock+70 -- 40010c0c Called from user_unlock+1c -- 40023a28 Called from tcp_send+88 -- 4002ee68 Called from big_wangdoodle+88 -- 400320a8 Called from one_wangdoodle+fc -- 40032200 Called from run_wangdoodle+78 -- 400322a0 Called from thr_exit -- 4000f7a0Here is the code where "wang_thr" is running. This is in the routine "change_thread()" which is in thread.c and is exactly where I have seen this several times before during a hang.
4001047c: e3073c58 movw r3, #31832 ; 0x7c58 40010480: e3443003 movt r3, #16387 ; 0x4003 40010484: e5933000 ldr r3, [r3] 40010488: e2833044 add r3, r3, #68 ; 0x44 4001048c: e1a00003 mov r0, r3 40010490: ebffbf7e bl 40000290There is only one call to "save_j()" in change_thread, which makes it easy to find the C code, which is:40010494: e1a03000 mov r3, r0 40010498: e3530000 cmp r3, #0 4001049c: 0a000008 beq 400104c4
if ( ! (options & RSF_INTER) && (options & RSF_CONT) == 0 ) { cur_thread->mode = JMP; if ( save_j ( &cur_thread->jregs ) ) { if ( thread_debug ) printf ( "Change_thread: emerged from save_j\n" ); return; } }How can this yield a hang or a loop? We are saving context and should then return. Even if the return yielded some kind of loop back to where we came from (i.e. a thread changing to itself) we would expect some jitter in the PC.
The code in tcp_send() is finished with a call to sosend(), is releasing the lock, and will get busy preparing another buffer to send. It may simply be releasing the lock, or (more likely) it is switching to a different thread to handle an incoming packet.
thr_unblock: shell (INT) change_thread: tcp-timer -> shell (via J) 0004 change_thread_resume_j Change_thread: emerged from save_j thr_block(2) for shell resched: current thread: shell SEM (0000) resched chose: tcp-timer READY 15 change_thread: shell -> tcp-timer (via I) 0000 change_thread_resume_i exit of tcp-timer resched: current thread: tcp-timer REPEAT (0002) resched chose: wang_thr READY 31 change_thread: tcp-timer -> wang_thr (via I) 0002 change_thread_resume_i change_thread_resume_i thr_unblock: tcp-timer (INT) change_thread: tcp-bsd -> tcp-timer (via C) 0004 change_thread_resume_c: 400a76c8 exit of tcp-timer resched: current thread: tcp-timer REPEAT (0002) resched chose: wang_thr READY 31 change_thread: tcp-timer -> wang_thr (via I) 0002 change_thread_resume_i change_thread_resume_i thr_unblock: tcp-timer (INT) change_thread: tcp-bsd -> tcp-timer (via C) 0004 change_thread_resume_c: 400a76c8 exit of tcp-timer resched: current thread: tcp-timer REPEAT (0002) resched chose: wang_thr READY 31 change_thread: tcp-timer -> wang_thr (via I) 0002 change_thread_resume_i thr_unblock: net (INT) change_thread: wang_thr -> net (via J) 0004 change_thread_resume_j Change_thread: emerged from save_j thr_unblock: shell (INT) change_thread: net -> shell (via J) 0004 change_thread_resume_j Change_thread: emerged from save_j 0thr_block(2) for shell resched: current thread: shell SEM (0000) resched chose: net READY 12 change_thread: shell -> net (via I) 0000 change_thread_resume_i thr_block(2) for net resched: current thread: net SEM (0000) resched chose: wang_thr READY 31 change_thread: net -> wang_thr (via I) 0000 change_thread_resume_i change_thread_resume_i thr_unblock: tcp-timer (INT) change_thread: tcp-bsd -> tcp-timer (via C) 0004 change_thread_resume_c: 400a76c8 exit of tcp-timer resched: current thread: tcp-timer REPEAT (0002) resched chose: wang_thr READY 31 change_thread: tcp-timer -> wang_thr (via I) 0002 change_thread_resume_i change_thread_resume_i thr_unblock: tcp-timer (INT) change_thread: tcp-bsd -> tcp-timer (via C) 0004 change_thread_resume_c: 400a76c8 exit of tcp-timer resched: current thread: tcp-timer REPEAT (0002) resched chose: wang_thr READY 31 change_thread: tcp-timer -> wang_thr (via I) 0002 change_thread_resume_i change_thread_resume_i thr_unblock: shell (INT) change_thread: tcp-bsd -> shell (via J) 0004 change_thread_resume_j Change_thread: emerged from save_jIt looks as though it is resuming "wang_thr" at priority 31, but nothing seems to happen. And why do we see "change_thread_resume_i" twice after it is selected?
A look at the code and some thought indicates that when we resume wang_thr, it resumes into an interrupted run of change_thread. Now this is a clue! If this is true, it will resume, then try to resume again (ending up in the same spot) ad infinitum. Interesting also is that the resumption point is just before locking interrupts in change_thread as if to prevent such nonsense.
The traceback indicates that all this happened when we were calling thr_unblock (as part of unblocking a locking semaphore). Perhaps locking back in thr_unblock() and staying locked would make sense. And the same also for thr_block().
Kyu / tom@mmto.org