December 12, 2022

BSD TCP/IP for Kyu - Race conditions, SPL, and semaphores

As I give this more consideration (on 1-6-2023) when this is all working, I see there is another way other than "big locks" that could have been used to handle this. See: So the solution turns out to be what I call "big locks" which you can read about in another page. The original BSD code used the venerable "spl" set of calls to manipulate the processor interrupt level and block interrupts. One thing about these is that they nest nicely. One critical section within another works just fine, whereas my blocking scheme used Kyu semaphores and has different semantics. After spending some time trying to just add a single call at every location I found one of the spl() calls in the BSD code, I gave up -- made all of those noops, and added big locks as discussed elsewhere. This works fine, but more fine grained locking might yield better performance. Yet another project.

The play by play as I study the problem

I just unconvered a race condition when closing a socket. I call soclose(), which triggers network activity (sending a FIN and receiving an ACK). That network activity also decides it needs to close the socket and also calls soclose(). More or less.

I could "fix" the problem by sticking a delay in soclose() which caused the call made by my application to stall, and then the call generated by network activity would complete first. The reason this works is that the Kyu tcp thread (that runs tcp_input() and hence the bulk of the TCP processing) runs at priority 13 in Kyu and my application thread runs at priority 20. But this is hardly a solution, it is just an aid in diagnosing the problem.

The symptom of the problem was an ARM data abort. This was happening because I was deallocating a semaphore structure twice. The first deallocation put it back on the free list (and corrupted it). The second deallocation tried to use corrupted data.

One of course asks, "why is the proven BSD code suddenly plagued with race conditions?". The answer is that I stubbed out the calls to splnet, splimp, and splx that fiddle with processor priorities and that BSD used to lock critical sections. The question becomes how to lock critical sections in Kyu for this code (semaphores of course). Actually manipulating ARM processor interrupt priorities in a way that mimics what a VAX used to do is nothing I want anything to do with. It is worth noting that for BSD4.4 on the MC68020 processor, that is exactly what they did.

What I am doing right now (and what has fixed this specific bug) is to replace splnet() with a routine I call "net_lock()" and the corresponding splx() with a routine I call "net_unlock()". I add these to soclose() (which did lock a critical section with splnet/splx) and I am half way there.

The other half is realizing that the routine "tcp_input()" is called by the interrupt code on BSD and just runs at splnet the entire time when performing TCP processing. So I wrap my call to tcp_input() with my net net_lock/net_unlock routines (which are implemented by a mutex semaphore) and I am in business. My bug is gone and everything works. Well, probably not everything because there are other critical sections wrapped in splnet/splx that I need to look at.

There is another more tricky issue. The BSD code uses two special interrupt priority levels for the network code: splnet and splimp. It uses splimp to really tighten things up and avoid contention even with code running at interrupt level. Most of the TCP code uses splnet, which is a software interrupt level that does not preempt any hardware interrupts. The vital thing is that when the function tcp_input() is called with a new packet to process, it runs at splnet. The only time splimp is used is when pulling packets out of a queue, which has packets loaded into it by interrupt code.

The tricky aspect of this is that if I simply replace both splimp and splnet with calls to net_lock I am likely to see deadlock. In particular, if I am already locked as if at splimp level, and then try to lock again to go to splnet level, I will deadlock.

It turns out thought that in practice this is not a big concern. This is because I see only one call to splimp in the TCP code and I am not confronted with significant issues eliminating it.

The way to think about this is to consider two things:

In BSD, a "mega" lock is used for all network processing. Also in BSD, the different threads are the top and bottom halves of the kernel. The top half being code being run on behalf of a certain process, and the bottom half being code run in response to an event (such as a device interrupt).

Large locks like this can lead to inefficient code. A proper way to do things is to have different locks for every resource where there might be contention. This would require a lot of careful study of the code. For now I am willing to do something that closely emulates what was done by BSD (but without the splimp/splnet distinction) and just get things working.

What I am looking to do is to ensure that there are locks in code that would have been part of the "top half" in the BSD kernel (as indeed there are) and replacing each of them with my locking via semaphore scheme.

Correct is a first goal. Fast can come later (if ever).


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org