mstill.dev / blog / L1 Cache Sim: part 1 []

I am making my way through Understanding Software Dynamics by Richard Sites. I highly recommend its opening chapters as a great introduction to L1 caches particularly the details around associativity. Shoutout to @eatonphil for reminding me about the book after seeing it mentioned elsewhere recently.

In chapter 4 there is a worked example of matrix multiplication in an 8-way associative cache; I thought it would make for a nice interactive visualisation (I have just rewritten my blog in SvelteKit and it makes this kind of interactive page really quite straightforward).

Quick reminder: the L1 cache is the smallest and fastest of a series of caches within a CPU to boost performance by retaining recently accessed data for quick re-access. With latencies a couple of orders of magnitude faster than accessing main memory, you can get huge performance gains by accessing data in a way sympathetic to the cache.

Sites describes three types of caches:

Briefly, a fully associative L1 cache can store (a cache-lines worth of) data at a particular address in any of its cache lines. That would be great, but requires circuitry to support 512 parallel lookups.

On the other end is the direct-mapped cache, where a given main memory address can only be stored in one particular cache line of the cache (based upon the lowest bits of the address). In this case the lookup is simpler because the cache line where we may find that data is found directly based upon those low bits, then whether we have a cache hit or miss is just whether the main memory address being looked up fully matches the tag. The downside of this type of cache is that if you happen to be alternating (in a tight loop say) between two bits of data that happens to share the same lower bits you will get a cache miss on every access, constantly evicting the data you are about to access again.

A compromise between the two is the set-associative cache with N-way sets where low bits again choose from a set of cache lines. Within each set there are N cache lines that are available to house data at a given main memory address. The N tags are then looked up for a match.

In Understanding Software Dynamics, the example L1 cache has this spec:

The total number of cache lines is the total size of the cache / cache line size. A given memory address can be only be cached in a particular set (1 of 64), with 8 lines in each set where it the actual line is cached.

Sets will be drawn vertically and “ways” will be drawn horizontally. Rather than draw the data in the cache lines (which would be too wide to draw and we don’t really care about what the data is, only its access pattern), instead we draw the tag (main memory address) of the first byte of data in the cache. With nothing in the cache we will just render ________.

Let’s start with a simple example program. The program will look something like this:

const data: []u8 = ...

var i: usize = 0;
while (i < count) : (i += 1) {
    const datum = data[i * stride];
    ...
}

The program simply accesses count items from array data with a given stride. For sequential access of bytes stride = 1. We don’t really care where in memory this array lives…in an actual program this may be an address on the stack or on the heap. We’ll choose an offset in memory for the array of 4096. For starters we’ll read 129 bytes of data (two cache lines worth, plus one). We default to running the loop once.

These parameters Array address, Count, Stride and Loops can all be changed, but try initially just hitting the Run button.

With the default values, what do we observe? First, we display the number of cache hits, cache misses and the relative number of misses. Obviously the number of hits + misses will equal 129. You will also see that 3 lines end up in the cache. When we initially populate a cache line we will always have a cache miss (assuming no prefetching). So it makes sense that we end up with 3 misses, leaving 129 - 3 = 126 cache hits.

What I want you to do now is play around with the stride value.

Hopefully this shows, even in this simple example, how some access patterns are terrible. Despite accessing the same number of bytes the performance between stride = 1 and stride = 4096 might be 10x (because in the latter case we always have to read from L2, if not slower caches / main memory, and L2 is roughly and order of magnitude slower than L1).




Wait…I never explained how a particular set is chosen.

As stated earlier, our cache lines are all 64 bytes. Given a main memory address for a single byte we need to lookup the cache line (assuming it is cached) and pick out the particular byte in question (1 of 64 possible bytes). So we can select the byte in question using the lowest 6 bits of the address. Why 6 bytes? Because 2^6 = 64.

For reasons I won’t go into here (I recommend buying the book!), the maximum cache size is related (maybe unexpectedly) to the page size of the machine. This is typically 4096 bytes. With a page size of 4096 the lowest 12 bits of a virtual address remain unchanged from the mapping to physical addresses, and it is this 12 bits we are going to use in set and byte selection.

We are already using the lowest 6 bits to chose which of the 64 bytes in our cache line to choose. This leaves us with another 6 bits to choose the set. And hence why our cache has 64 sets, because 2^6 = 64.

Below we show a memory address in binary, truncated to only show the lowest 15 bits:

…010010010000101

The yellow bits show we are indexing byte 5 (of 64 bytes) and the purple bits show that this address (and any other address that shares the purple bits) will be cached in set 18.




I think that’s enough for one post. I hope to follow up with a second part where I work through the matrix multiplication example from the book.