May 5, 2026

Allwinner H5 -- fun with CMO

Look here! A new acronym.

CMO popped up in some U-boot sources and is jargon for "cache maintenance operations". We only care about two, namely flush and invalidate. The ARM literature doesn't talk abbout flush, instead it talks about clean.

ARM aarch64 has both a dc and ic instruction. DC for the data cache and IC for the instruction cache. It turns out you don't get to flush the instruction cache, which makes sense when you think about it. The instruction cache is read-only as instructions are fetched. You may need to invalidate it, in particular when you have loaded new code from disk or the network.

We use two instructions in dealing with the data cache:

	dc      civac, x0       /* clean and invalidate data or unified cache */
	dc      ivac, x0		/* invalidate data or unified cache */
I worried at one point about needing to separately deal with the L2 cache, but these instructions take care of both L1 and L2 all at once. They deal with a cache line (which on our hardware is 64 bytes) and get the address of the line in the x0 register.

Back to the h5 network driver

I became suspicious in dealing with the network driver for Kyu that something was just not right with the cache. Blocks would be read, but then be corrupted at least partially. And the corruption seemed to involve getting overwritten with information from a previous read.

After getting an understanding of the aarch64 MMU, I decided to do an experiment. The MMU setup I get from U-boot uses 4 PTE entries, each dealing with a 1G range of virtual addresses. The last 2G are useless, as we only have 1G of ram. The U-boot map set up the first 1G for device registers, and the next 3G for ram.

What I did was to change the last 2 mappings. I make them uncached and I let each one point to the same physical memory at 0x40000000 as the first ram map from U-boot. This yields the following VA map:

0x00000000 - 0x3fffffff - IO regisrters uncached.
0x40000000 - 0x7fffffff - ram cached (1G)
0x80000000 - 0xbfffffff - ram uncached (1G)
0xc0000000 - 0xffffffff - ram uncached (1G)
The last mapping was needless -- I could have set the PTE to 0 and made this invalid.

This setup allows me both cached and uncached access to the same physical memory. For no particular reason, I used the 32 bit word at 0x70000000 (and 0xb0000000) for my experiment. Here is what I did:

I wrote that value 0xdeadbeef to both cached and uncached addresses, then examined the result. As you might expect I see:

 Peek at 0000000070000000 = deadbeef
 Peek at 00000000b0000000 = deadbeef
 Peek at 00000000f0000000 = deadbeef

Then I wrote the value 0x0001abcd to an uncached address. This places it into physical memory. Now I see the following. The read from the first address is picking up a stale and incorrect value from the cache, which is what we expect.

 Peek at 0000000070000000 = deadbeef
 Peek at 00000000b0000000 = 0001abcd
 Peek at 00000000f0000000 = 0001abcd

Next I do a write of 0x1111abcd to the cached address. The cache is apparently not write through, so this value never makes it to physical memory, and we see the following. All this is expected cache behavior.

 Peek at 0000000070000000 = 1111abcd
 Peek at 00000000b0000000 = 0001abcd
 Peek at 00000000f0000000 = 0001abcd

The next thing is to introduce some CMO. To simulate what happens when we want to write to a network buffer, we write to the cached address, then perform a flush (clean and invalidate). Now we see this, which is correct.

 # - before the flush
 Peek at 0000000070000000 = aaaabbbb
 Peek at 00000000f0000000 = eeeeffff
 # - after the flush
 Peek at 0000000070000000 = aaaabbbb
 Peek at 00000000f0000000 = aaaabbbb

Next we want to simulate what happens when we read from a network buffer. In this case DMA will place data into physical memory, an interrupt will tell us data has arrived, and we will want to invalidate before we try to read the data. We write 0xccccdddd via an uncached address, then check what we read from our cached address. The following result is correct.

 # - before the invalidate
 Peek at 0000000070000000 = aaaabbbb
 Peek at 00000000f0000000 = ccccdddd
 # - after the invalidate
 Peek at 0000000070000000 = ccccdddd
 Peek at 00000000f0000000 = ccccdddd

Now we try a twist on the above. We first do a write to our cached address. This will yield data in the cache which is "dirty" and the system knows has not yet been written to physical memory. An invalidate should just discard this data, but that is not what happens.

 # - before the invalidate
 Peek at 0000000070000000 = 0000aaaa
 Peek at 00000000f0000000 = ccccdddd
 # - after the invalidate
 Peek at 0000000070000000 = 0000aaaa
 Peek at 00000000f0000000 = 0000aaaa
We now have our finger on the problem. The invalidate is actually doing a flush and pushes stale data from the cache into physical memory. The case above did not do this because the last thing done to the cached address before we began the test was a flush, so the cache was clean in the eyes of the system.

We can "fix" this by doing an explicit flush before a simulated DMA write to the buffer. As we just said, this is what we did "by accident" in the first write test above that worked.

We could modify our network driver to always do a flush to the buffer before making it available to receive data. This would be a workaround for not having a proper invalidate operation. I would rather did deeper and try to find out why invalidate is acting like flush.

Analysis

For some reason invalidate is the same as flush. The ARM instructions involved are:
    dc      civac, x0       /* clean and invalidate data or unified cache */
    dc      ivac, x0        /* invalidate data or unified cache */
So, "dc ivac" is acting the same as "dc civac".

A bit exists in the HCR_EL2 register which forces this behavior, but only for code running at EL1 -- so this does not explain our case (we run at EL2). The fact that ARM provided the ability to force this behavior in this case makes me wonder if there is some bit someplace else that forces this behavior for EL2. I have not found such a bit.

I did try reading the HCR_EL2 register, and this bit is set! Even more curious is the discovery that I cannot write 0 to this bit. This is suspicious and suggestive, but I am not sure what to make of it.

Without a doubt, having "ivac" work the same as "civac" explains why my network driver does not work. U-boot has a working network driver for the H5 (it used it to load my code via tftp). What is it doing differently.

Turn off the D cache entirely

What would happen if I turn off the D cache altogether? Things will of course run more slowly, but if this "bug" is the only thing causing my network driver to fail, it ought to work with the D cache turned off.

Turning off the D cache is a bit more involved that just flipping the bit in the SCTRL register. It is also necessary to flush the D cache. Otherwise software will suddenly find itself running with stale data. In particular the stack will be hopeless. I have tried this and can testify that nothing good comes out of it.

However when I do flush and do things properly it works! My code runs a lot slower (some loops that can through all 1G of ram really take a long time now). But the network works!

At first I put the call to turn off the D cache as more or less the very first instruction after Kyu boots. After this worked, I moved it towards the tail of the initialization code. This also works and lets the large scale memory probing run at full speed.

The Kyu network test menu provides:

n 1 - display statistics
n 2 - ARP ("ping" a host with an ARP request)
n 5 - ICMP (pings my boot host)
n 6 - DNS (a variety of queries)
n 4 - DHCP (send a request)
n 8 - TFTP (fetches tftp.test)
I run "n 8" multiple times (it fetches a 40648 file named "tftp.test")

So in a sense, I have achieved my goal. I have a working network driver for the h5.

I have a new puzzle to solve -- this business of why a dcache invalidate works like a flush. This is an ARM puzzle, not specific to the H5.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org