Cache

The purpose of this document is to help people have a more complete understanding of what memory cache is and how it works. I discuss the implementation and comparative advantages of direct mapped cache, N-way set associative cache, and fully-associative cache. Also included are details about cache design. I try to give a complete treatment of the more important aspects of cache.

Overview
Direct Mapped Cache
- Working in the Fields
- Example
  - Initially
  - Step 1
  - Step 2
  - Step 3
  - Step 4
- Practice Problems
  - Questions
  - Answers
N-Way Set Associative Cache
Fully Associative Cache
Unifying Theory
Cache Design and Other Details
Cache Aware Programming

This is a supplement for the notes and text, not a replacement. Although this text can be read on its own without reference to the book or the notes, this is more text heavy, while the notes provide more valuable visual aides and a far more concise overview than I provide here. Also, there are many aspects of cache described in the notes that I do not describe in this version of the text, like sub-blocks.

The reason you should care is because a programmer can design software to take advantage of cache. If a programmer is aware of cache and its trappings, it is a simple matter to arrange memory accesses to take advantage of cache and have faster run times. A great proportion of computing is accessing memory, and anything that can be done to speed memory accesses can only be a good thing. Because of this, I consider cache to be one of the more important topics of computer hardware.

Because cache is such a multifaceted subject, and because learning about cache requires the introduction of a lot of new ideas, this document is necessarily long. As with all of these notes, I tried to make each section portable, so that if you only need to firm up your understanding on one subject, you may skip to the relevant section.

Overview

The purpose of cache is to speed up memory accesses by storing recently used chunks of memory. Now, main memory (RAM) is nice and cheap. Not as cheap as disk memory, but cheap nevertheless. The disadvantage of this sort of memory is that it is a bit slow. An access from main memory into a register may take 20-50 clock cycles. However, there are other sorts of memory that are very fast, that can be accessed within a single clock cycle. This fast memory is what we call cache.

However, this fast memory is prohibitively expensive, so we can’t compose our main memory entirely of it. However, we can have a small amount, say 8K to 128K on average, in which we store some of our data. But what data do we choose? The way it happens, we choose data that has been most recently used. This is a policy that takes advantage of something called spatial and temporal locality.

Spatial and temporal locality of memory accesses refers to the idea that if you access a certain chunk of memory, in the near future you’re likely to access that memory, or memory near it, again. For example, you execute consecutive instructions in memory, so the next dozen of so instructions that you access are probably contained in a given contiguous block of data, so the next instruction would be accessed close by. Similarly, if you’re accessing elements of an array, they will be located consecutively, so it makes sense to store a block of data that has recently been used. Separate variables called in a function are usually close by as well. That is what cache does.

When we want to access memory at a certain address, we look in the cache to see if it is there. If it is there, we get it from cache instead of going all the way to memory; that situation is called a “cache hit.” If the data is not in cache, then we bring the data from memory to cache, which takes longer since the data must be brought in from the smaller memory; that situation is called a “cache miss,” and that delay that we endure because that memory must be brought from main memory is called a “miss penalty.”

Direct Mapped Cache

Direct mapped cache works like this. Picture cache as an array with elements. These elements are called “cache blocks.” Each cache block holds a “valid bit” that tells us if anything is contained in the line of this cache block or if the cache block has not yet had any memory put into it. There is also a block of data called a “line” which stores a recently used block of data. Perhaps most importantly, there is a number called a “tag” composed of more significant bits of the memory address from which the line’s data came from.

Valid	Tag	Line	Index
			N
			N+1
			N+2
			N+3
			N+4

You can think about the direct mapped cache this way. Each row in the table to the left represents a cache block. We have our valid bit which tells us if this cache block currently holds data. Next is our tag, a number that tell us where in memory this bit is from. After that, we have our line, which is the data that we have stored in cache.

The number to the right is just the cache index. Cache blocks are indexed like elements in an array, and there is a cache index associated with each cache block. As you see, I show blocks N through N+4, where N is some integer such that these are valid cache blocks for this particular cache.

As a nota bene, you must know that for dirct mapped cache, cache sizes and line sizes (and thus the number of cache blocks) are made in powers of two. It works out better for the hardware design if this is the case. Therefore, we’re not going to see direct mapped cache’s with 15 kilobytes, or 700 bytes, or what have you. We’re working only in powers of two for direct mapped cache.

For a direct mapped cache, suppose we have a cache size of 2^M bytes with a line size of 2^L bytes. That means that there are 2^M-L cache blocks. Now, we use the address of the data we’re trying to access to determine which of these cache blocks we should look at to determine if the data we want is already in memory.

Working in the Fields

Which Bits	Field
32-M bits	Tag
M-L bits	Index
L bits	Byte Select

We use the address in this way. Certain bits in our 32 bit memory address have special significance. We group the bits into three distinct groups we call fields. In this discussion, M and L are defined just as they were a paragraph ago.

The rightmost L bits of the address is the byte select field. If data is found in a cache block, since addresses are byte addressed, we use this field to determine which of the bytes in a line we’re trying to access.

The M-L bits just to the left of the byte select field is called the index field. As we said before, we have 2^M-L cache blocks in our cache. These cache blocks are ordered, just as in an array. There is a first one (with index 0) and a last one (with index M-L-1). If we’re given an address, the value represented by the bits of the index field tell us which of these cache blocks we should examine.

The leftmost 32-M bits of the address, just to the left of the index field, is called the tag field. This field is put into the tag part of the cache block, and that uniquely identifies from where in main memory the data in the line of this cache block came from. If we’re looking at a cache block, we presumably know what index of this cache is, and therefore in conjunction with the cache tag we can compose the addresses from which the data in the line came from. When we store a line of data into a cache block, we store the value represented by the tag field into the tag part of the cache block.

As a side note, it is not written in stone that the index field must be composed of the bits immediately to the left of the byte select field. However, the way that memory accesses tend to work, it happens that if the index field is in the bits immediately to the left of the byte select field, when we look for data, it is in cache rather than in main memory a larger proportion of the time. While there might be some special cases in which having the index field elsewhere would yield a higher hit ratio, in general it is best to have the index field immediately to the left of the byte select field.

When we’re checking cache to determine if the memory at the address we’re looking for is actually in cache, we take the address, and extract the index field. We find the corresponding cache block. We then check to see if this cache block is valid. If it’s not valid, then nothing is stored in this cache block, so we obviously must reference main memory. However, if the valid bit is set, then we compare the tag in the cache block to the tag field of our address. If the tags are the same, then we know that data we’re looking for and the data in the cache block are the same. Therefore, we may access this data from cache. At this point we take the value in the byte select field of the address. Suppose this value is N. We are then looking for byte N in this line (hence the name byte select).

Example

However, these things are always more clear when we talk about a specific example. Let’s talk about my first computer, a Macintosh IIcx. Suppose we have a direct mapped cache whose cache lines are 32 bytes (2⁵ bytes) in length, and suppose that the cache size is 512 bytes (2⁹ bytes). Since there are 512 bytes in the cache, and 32 bytes per cache block, that means that there are 512/32 = 16 cache blocks.

In reality, a 512 byte cache is a very small cache; you’d never find a cache that small in today’s computers. However, it makes the example that follows manageable.

Now, suppose we have addresses that are thirty-two bits in length. In this example, taking M and L to have the same definitions as they did at the beginning of our discussion on direct mapped cache, M = 9 and L = 5. So, our byte select field is the rightmost 5 bits. Our index field is the next M-L = 4 bits. Our tag field is the leftmost 32-M = 23 bits.

This actually makes sense. Since we have 16 cache blocks, we can uniquely represent the 16 possibilities of cache blocks we may address by 4 bits, which is the size of our index field. Similarly, we have 32 bytes that we can access in the data potion of our cache block since line size is 32 bytes, and we may uniquely address these 32 bytes with 5 bits, and indeed that is the size of our byte select field.

To illustrate the working of this, I’ll go step by step through a series of memory accesses. In my table representation of cache I omitted the cache line to save space, because it really doesn’t matter what the line holds. The data in cache is the data in cache, and that’s all there is to it. We’re interested in what memory is accessed where, because that is what affects the workings of the cache. That changes the state of the valid bits and the tag numbers of our cache blocks.

For this example, I provide a series of tables, which represents the cache blocks, and their corresponding valid bits and tags. The first number in a row in the table is the index of this particular cache block. The next number is the valid bit. The rightmost number is the tag, which tells us where the data in the line (which I do not show) is from.

Initially

The table for the initial conditions shows that the valid bits are all set to zero, because nothing has yet been stored in cache.

Step 1

For step one, suppose we access the memory at address 0x0023AF7C. The looking at that in binary, that is 00000000001000111010111101111100₂. If we separate these bits into the lengths of the fields we have determined with dashes between each field, that is more easily read as 00000000001000111010111-1011-11100. So, our index is 1011₂=11₁₀. We look at index 11. Is there anything in there yet? We see there is not. Therefore, we load the data contained at memory addresses 0x0023AF60 through 0x0023AF7F into the 32 byte line of the cache block with index 11.

Note that the memory addresses between those two addresses are the memory addresses that would have the same tag and index fields if we broke the addresses into their respective fields.

Now that the data is loaded into the line, we set the tag for this cache block to be the same as the tag of our address, since this cache block will now hold a line of data from the addresses with the same index and tag. This tag is 1000111010111₂=4567₁₀. We also must set our valid bit to true since this cache block now holds a block of data. These changes are reflected in the table for step 1.

After these changes and the memory loads, we know that the data we want is in cache block 11. Our byte select field is 11100₂ = 28₁₀, so we get the 29^th byte of our line.

Since 0 would address the first byte, 28 addresses the 29^th byte. It’s exactly as though the cache line is an array of chars (a byte length quantity), and the byte select is the index of the particular element of that array that we’re looking at right now.

Now, this is a cache miss, but there are different sorts of cache misses. Notice that no data has been loaded into cache yet. That is because no memory accesses have taken place. Because we’re just starting out, this is called a “compulsory miss.” We cannot help but not have data in this cache block yet, because we just started.

Step 2

After step 1, suppose we have another memory access, this time at memory address 0x0023AE4F. Ignoring the intervening binary operations, this works out to have a byte select field of 15, an index field of 2, and a tag field of 4567.

If we look at cache block 2, we see that it is not valid. Similar to last time, we load the 32 bytes of memory from addresses 0x0023AE40 to 0x0023AE5F into the line of this cache block, change the tag to 4567, and set the valid bit to true.

The tag field is the same as in the previous operation; this was intentional on my part. As we see, the fact that the tag is the same here as it was last time is of no significance, since the two are part of completely different indexed cache blocks; they are unrelated insofar as cache is concerned.

Similar to last time, the data we want is in the cache line. Our byte select field is 15, so we access the 16^th byte (since 0 would address the first byte) of the line in cache block 2.

Even though we’ve already had a memory access in step two, this miss is also called a compulsory miss because the cache block that we’re loading into hasn’t yet had any data loaded into it.

Step 3

Suppose now we have a memory access at address 0x148FE054 (in binary, 00010100100011111110000 0010 10100). This works out to have a byte select field of 10100₂=20₁₀, index field of 0010₂=2₁₀, and a tag field of 00010100100011111110000₂=673776₁₀.

We look at cache block 2. We see that it is valid. We then compare the tags. 673776 is not equal to 4567, which is the tag of cache block 2. This means that, even though there is data in this cache block, the data we want is not in here. We must load the data, and we do so just as if this block had not been valid.

We load the data at addresses 0x147FE040 to 0x147FE05F into the line at cache block 2. We set the tag to 673776. We don’t need to set the valid bit to one since it already is one. Now this cache block holds a new portion of data.

As you see, we had two pieces of data that wanted to be in the same cache block: there was the data we loaded in the previous step, and the data we just loaded in this step. Now, this is not a conflict miss. What would be a conflict miss is if we tried to load the data that used to be in the cache (the data that we just replaced) back into cache. When this sort of miss occurs, that is called a “conflict miss,” or rather a “collision.”

We also see one of the primary weaknesses of the direct mapped cache. At the end of step two we had two blocks used up, which means that we’re using 64 bytes of our 512 byte cache. However, in this step, step three, because of a conflict miss we overwrote one of those blocks even though we had 448 bytes unused in our cache. Though this example is completely contrived, this sort of cache inefficiency is a common problem with direct mapped cache.

Once we load the data into this cache and set the tag field appropriately, we can find the byte that we want using the byte select field, at index 2 (the 3^rd byte).

Step 4

Now suppose we try to access memory address 0x0023AF64. This works out to have byte select field of 00100₂=4₁₀, index field of 1011₂=11₁₀, and tag field of 1000111010111₂=4567₁₀.

We look at cache block 11, and it is valid, so we compare the tags. The tags are the same, it turns out, so we finally have a cache hit! The data that we want is in this cache block. Since our byte select field is 4, we access the fifth byte of this block’s cache line. We don’t modify cache at all; tags and valid bits remain the same, and we needn’t pull any data from main main memory since the memory we want has already been loaded into cache.

This block was loaded with data in step one. The memory access for step one was 0x0023AF7C. Now, this step’s memory access, 0x0023AF64, is obviously a different address than that, but 0x0023AF64 and 0x0023AF7C are pretty close. Since in the memory access for 0x0023AF7C we loaded that cache block’s line with memory from 0x0023AF60 to 0x0023AF7F (32 bytes in all), the data for 0x0023AF64 could be found there. So, two memory accesses may access the same data in cache even though they are not exactly the same.

Initially
Idx	Val	Tag
0	0	0
1	0	0
2	0	0
3	0	0
4	0	0
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0
10	0	0
11	0	0
12	0	0
13	0	0
14	0	0
15	0	0

Step 1
Idx	Val	Tag
0	0	0
1	0	0
2	0	0
3	0	0
4	0	0
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0
10	0	0
11	1	4567
12	0	0
13	0	0
14	0	0
15	0	0

Step 2
Idx	Val	Tag
0	0	0
1	0	0
2	1	4567
3	0	0
4	0	0
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0
10	0	0
11	1	4567
12	0	0
13	0	0
14	0	0
15	0	0

Step 3
Idx	Val	Tag
0	0	0
1	0	0
2	1	673776
3	0	0
4	0	0
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0
10	0	0
11	1	4567
12	0	0
13	0	0
14	0	0
15	0	0

Step 4
Idx	Val	Tag
0	0	0
1	0	0
2	1	673776
3	0	0
4	0	0
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0
10	0	0
11	1	4567
12	0	0
13	0	0
14	0	0
15	0	0

This set of tables trace the 16 cache blocks of the cache example through each of the memory accesses described above. Each "step" table reflects the state of cache at the end of each step.

Wow. That just kept going, didn’t it?

Practice Problems

These are just a couple of questions that demonstrate the most basic concepts of direct mapped cache. The answers are provided after all the questions. These aren’t terribly difficult, but they should get you thinking about cache. All of these questions pertain to direct mapped cache, in a computer with 32-bit addressing.

Questions

If we have a cache size of 4KB with a 16 byte line size, what will be the sizes of the three fields (tag, line, and byte select) which we divide our address into?
Suppose our fields are arranged as described above (first tag field, then index field, then byte select field) in a cache of 16KB with 32 byte line sizes. What would be the values of each of the three fields for the following addresses?
1. 0x00B248AC
2. 0x5002AEF3
3. 0x10203000
4. 0x0023AF7C

Answers

Our cache size is 4KB = 2¹² bytes, with a line size of 16 bytes = 2⁴ bytes. Therefore, our byte select field is 4 bits, our index field is 12-4 = 8 bits, which leaves 20 bits for our tag field.
This is actually two problems in one. We must first determine the sizes of each of the fields. Our cache size is 16KB = 2¹⁴ bytes, with a line size of 32 bytes = 2⁵ bytes. So, our byte select field is 5 bits, our index field is 14-5 = 9 bits, which leaves 18 bits for our tag field.
1. For 0x00B248AC, tag is 0x2C9, index is 0x45, and byte select is 0xC.
2. For 0x5002AEF3, tag is 0x1400A, index is 0x177, and byte select is 0x13.
3. For 0x10203000, tag is 0x4080, index is 0x180, and byte select is 0x0.
4. For 0x0023AF7C, tag is 0x8E, index is 0x17B, and byte select is 0x1C.

N-Way Set Associative Cache

Another sort of cache is the N-way set associative cache. This sort of cache is similar to direct mapped cache, in that we use the address to map the address to a certain cache block. The important difference is that instead of mapping to a single cache block, an address will map to several cache blocks. For example, in a 2-way set associative cache, it will map to two cache blocks. In a 5-way set associative cache, it will map to five cache blocks.

In this cache there may be several cache blocks per index. This group of cache blocks is referred to collectively as an “index set.” In our direct mapped cache, there was one cache block per index set. In an N-way set associative cache, there are N cache blocks per index set.

So, we have an address, and like in direct mapped cache each address maps to a certain index set. Once the index set is found, the tags of all the cache blocks in this index set are checked in parallel. If one of those cache blocks holds the data that we’re looking for, then the data for that one block is extracted.

If one of those blocks does not hold the data we want, then like before we have a cache miss, and the data must be loaded. If one of the cache blocks in this index sets is free, then we can simply load it in there. If, however, all of the blocks are already taken, then one of those blocks must be overridden. We choose to free up the one that has been least recently used, and write the new data in that block. This is a policy called Least Recently Used (LRU) replacement.

There are any number of ways (some better than others) in implementing a system that traces the order in which cache blocks of an index set have been used, so that we may appropriately choose which cache block to replace in if we must replace a block. It is not very important to know exactly how this is done, but you should think about ways of implementating this feature.

Sizes of Fields for N-Way Set Associative

Like direct mapped cache, when memory accesses occur the address is broken into three fields: tag, index, and byte select. The sizes of fields for N-Way Set Associative are very similar to those of direct mapped cache.

For an N-way set associative cache, suppose we have a cache size of N·2^M bytes with a line size of 2^L bytes. That means that there are N·2^M-L cache blocks, and at N cache blocks per index set, there are 2^M-L index sets.

For example, suppose we have a three-way set associative cache of size 12KB, with line size of 16 bytes. Our line size may be expressed as 2⁴ bytes, so L=4. Since N=3, and 12KB = 3·4KB = 3·2¹² bytes, M=12. So, in this example, our byte select field would be 4 bits long, our index field would be 12-4=8 bits long, and our tag field would be 32-12=20 bits long.

Example : 2-Way Set Associative Cache

Suppose we have a series of memory accesses. For the sake of the example, suppose that the memory accesses all map to the same index set, but that the addresses’ tags differ. Which index is unimportant, since they’re all to the same index set. What is important is the tag of each memory access.

For each step I provide a table that shows the tags of the single index set that we’re considering. The cell on the right is that structure which tells us the order in which things were last accessed. I have it initially set so that the system will think that the first block was the least recently used block, so the first block (leftmost) will be the one first used. The table is shown at the end of a step to show what the index set looks like at that point.

Block 1 Tag	Block 2 Tag	Access Order
No tag	No tag	Second, then first.

Now, suppose we have a series of these memory accesses to the same index set as described, with the following tags for each memory access: 0x00010, 0x8A02F, 0x00010, 0xF1EEE.

For the first access (tag 0x00010), we’re assuming that this index set is empty. So, we load the data with tag 0x00010 into the first of our two cache blocks.

Block 1 Tag	Block 2 Tag	Access Order
0x00010	No tag	First, then second.

It is important to remember that in N-way set associative caches there is a structure for each index set that tells us the order in which the index set’s cache blocks were last accessed. For our two-way cache, the possible states for the order of access are “first, then second” or “second, then first.” For a three-way cache, the possible states are “first, then second, then third,” “first, then third, then second,” etc. (There are six states in all for a three-way cache.) This tells our hardware just which block to replace if we must replace a block, so that our more recently used blocks are not disposed of before their time.

For the second access (tag 0x8A02F), we compare the tags. 0x00010 does not equal 0x8A02F, but we have the second cache block unused (that is, invalid), so we stick the data with tag 0x8A02F into the second of the two cache blocks. At this point we have another structure in this index set that lets the hardware know that the second cache block has been used most recently.

Block 1 Tag	Block 2 Tag	Access Order
0x00010	0x8A02F	Second, then first.

For the third access (tag 0x00010), we compare the tags. The tags for the first of our two cache blocks match. Our tags don’t change and we don’t load any more data, but we change our structure that describes the order in which the blocks were last accessed to reflect that the first block was more recently accessed.

Block 1 Tag	Block 2 Tag	Access Order
0x00010	0x8A02F	First, then second.

For the fourth and last access (tag 0xF1EEE), we compare the tags. There is no match, and there are no blocks free for this index set. So, what happens? We replace the block that has been least recently used. We have stored the information that, for this index set, the first cache block was more recently used. Therefore, we replace the second block. We keep the data with tag 0x00010, and discard the data with tag 0x8A02F and load the data with tag 0xF1EEE in its place into the second block. Now we change the order of access to reflect that the second block was more recently accessed than the first block.

Block 1 Tag	Block 2 Tag	Access Order
0x00010	0xF1EEE	Second, then first.

Example : 3-Way Set Associative Cache

I’m not going to go tremendously in depth here. I’m just going to show a 3-way cache, with each block’s tag shown (or the text “no tag” if the block is invalid). After each index set, there will be a cell that tells us the order of “last accessed” for this index set.

Notice that we have this “order of access” field for each index set. Also notice that it doesn’t suffice to say “the third was the block least recently used,” because if we were to access that third block, then we have no way of telling which block was the one least recently used.

While we can have up to three memory locations that map to the same index in cache at the same time, we can’t have four. We therefore run into problems similar to those discussed in direct mapped cache (conflict cache misses), even though we can store more cache blocks per index set. The problem of conflict misses has been alleviated, but not solved.

Index	Block 1	Block 2	Block 3	Access Order
0	0x49206	0xC6F76	0x65204	Third, then first, then second.
1	0x36172	0x6F6C2	No tag	First, then second, then third.
2	0x04272	0x61756	0xE2C20	Second, then first, then third.
3	No tag	No tag	No tag	Third, then second, then first.
4	0x62757	0x42073	0x68652	First, then second, then third.
5	0x0646F	0x65732	0x06E6F	Second, then third, then first.
6	0x74206	0xC6F76	No tag	Second, then first, then third.
7	0x65206	No tag	No tag	First, then second, then third.
8	0xD652E	0x0B3F9	0x636FC	Third, then first, then second.
9	0x4D67A	0x7AB7E	No tag	Second, then first, then third.
10	0x78A7F	No tag	No tag	First, then second, then third.
11	0x22448	0x37CAD	0x3DA1E	First, then third, then second.
12	0x33C1A	0x14E66	0x5D56C	Second, then third, then first.
13	0x3414D	0x2585F	0x202E1	First, then third, then second.
14	0x295D0	0x0BC9F	No tag	Second, then first, then third.
15	0x632E0	0x255C9	0x82823	Third, then second, then first.

Set Associative vs. Direct Mapped

When we’re dealing with an N-way set associative cache, when we’re looking at a particular index set we have the added burden of comapring multiple tags to the tag of our address to see if there is a match. This is done in parallel, so we must have as many comparators as there are blocks in an index set to check the tags of the blocks to the address tag. We also have to deal with the access histories of each index set, another thing we didn’t have to do with direct mapped cache. So, N-way set associative cache is considerably more difficult to design and to produce, and is therefore more expensive. For the same money, an N-way set associative cache would be smaller than a direct mapped cache, though the advantages of set associative with regard to conflict misses and hit ratios often justifies the trade.

Fully Associative Cache

Fully associative cache is perhaps the easiest cache to understand and explain because it acts intuitively as one would expect a cache to work. We load cache blocks similar to before, and when we fill up the cache and have to replace a cache block, we dispose of the block that has been least recently accessed. It’s exactly as though we take the N-way set associative cache, and set N such that there was only one index set. So, we’re able to ignore all the garbage about the index field. We only have a tag field, and a byte select field.

This sounds good as an idea, since fully associative cache acts exactly as a cache “should”; we don’t run into the problem where a more recently used block is discarded and a less recently used block is kept simply because the index set where that recently used block happened to be stored was more busy. Clearly fully associative cache has its attractive aspect.

However, from a hardware designer’s point of view fully associative cache is extraordinarily complex. First of all, when you are checking the cache to see if a tag is there, you must check all the tags of the cache in parallel. That means that there must be as many comparators as there are cache blocks in the entire cache. In direct mapped cache we only needed one comparator, and in N-way set associative cache we needed N comparators to check the tags of cache blocks in parallel.

Also, we must deal with the fact that we must trace the order in which blocks were accessed. When we were dealing with two or three blocks per index set, this wasn’t a huge problem. However, as an example, if we have to trace the access history all the blocks of the entire cache, supposing we had C cache blocks to keep track of, then we have C! possible states for access histories. As a concrete example, if we have something as small as a 1K cache with 32 byte line sizes, that’s (1024/32)! = 32! = 2.6313·10³⁶ combinations for access histories. For our CPS120 graduates, those states may be represented in 118 bits (possibly less after optimizations, but don’t expect me to figure that out). The point is, no matter how we implement this cache, the circuitry for handling these combinations of access histories (especially considering that the circuitry would have to do its job in at most one or two clock cycles) would be extraordinary for larger sized caches.

Because there is so much hardware involved, fully associative caches are necessarily very expensive. When fully associative caches are even used its cache size is necessarily tiny compared to the direct mapped and N-way set associative caches. However, for very small caches with very few blocks fully associative is often a logical choice.

Unifying Theory

If this all seems too complex, the important thing to realize about all these cache types is that they are all really the same cache type: N-way set associative.

The direct mapped cache is actually the special case of one-way set associative. For each “index set” there is only one cache block. You may notice that there is no structure in direct mapped cache that keeps track of the access history of the blocks of each index set, but that is because there is only one possibility; that the one block we have was accessed least recently (or most recently depending on how you like to look at it).

Fully associative cache is equivalent to the special case of N-way set associative cache where N is chosen such that N equals the total number of blocks in the cache, i.e., such that there is only one index set. It is true that N-way set associative cache has an index field and fully associative cache does not, but in this case the index field works out to have a length of 0 bits, i.e. there is no index field. This is consistent with our discussion of fully associative cache.

Cache Design and Other Details

Line Size

In our examples, we often discuss line sizes (alternatively called block sizes) of 32 bytes, or 16 bytes, or what have you. But there is no real reason why you can’t have line sizes of 8 bytes, or maybe 128 bytes, or 512 bytes.

There is some appeal to having a larger line size, especially when dealing with many memory accesses with sequential addresses. Suppose, for example, you’re printing a string of many characters, or accessing elements of a large array of integers for computation, or executing a large number of sequential instructions.

If we’re accessing integers in an array sequentially for computation, what we’re really doing is accessing four byte quantities at a time. If our line size is 32 bytes, when we first load an integer we have a cache miss, and that line is loaded into memory. The next seven accesses to the array will access that line as the integers are read sequentially, so those will be cache hits. Then an integer is read that’s just beyond the line we just loaded, so we must load the next line. This is a cache miss, but again the next seven will be hits since the integers will be in that line we just loaded. This trend continues. One miss, seven hits. One miss, seven hits. Et cetera, et cetera. All in all, we have a miss rate of one eighth. However, if our line size is 64 bytes, we’ll have one miss, fifteen hits, one miss, fifteen hits, a miss rate of one sixteenth. Larger line sizes takes advantage of spatial locality.

Larger line sizes aren’t all good, however. The miss penalty, or the time we are delayed when we have a cache miss, is increased since when we bring the data into the cache we must copy more data, which takes more time.

The other thing is that as we increase our line size, the number of our blocks will decrease. While this needn’t be extremely bad, examine the extreme case where line size equals the cache size, so there is only one cache block in our entire cache. Suppose that our cache size is 8K, so our block size is 8K. So long as we access memory in the same contiguous 8K of space, we’ll be great, but if the program ever strays from that memory space we’ll necessarily have a miss. If we have a program that jumps around between two distinct memory locations, accessing one area, then accessing the other, back and forth, we’ll necessarily have a miss every time, whereas if we had two cache blocks, after the data is initially loaded we’ll have a hit every time.

The point is that line sizes are best taken in moderation.

Types of Misses

A cache miss occurs when the data we’re looking for is not in cache, and so we must load it from main memory. In the example for direct mapped cache, we alluded to the fact that there are different types of cache misses. We mentioned two, but there are four, and how a system designer goes about minimizing misses depends greatly on the type of misses involved.

In this explanation I provide an analogous situation. Suppose we’re writing a paper in Perkins. We can take books (data) from stacks (main memory) and put them on our desk for easy reference (cache).

Compulsory Miss

A compulsory miss occurs at the first access of that memory location, so memory location has not yet been loaded into cache. These are also called cold start misses, or first reference misses. There isn’t much that can be done about this problem. However, since a compulsory miss only occurs when data is first loaded, in the long run with a program with many memory accesses, compulsory misses are not a problem.

This situation is analogous to when you’re just starting your research in Perkins. You desk is empty, so you must first get some of the books that you initially want. You can’t help this, since the books won’t be at the desk waiting for you when you arrive.

Conflict Miss (Collision)

A conflict miss occurs when there are many memory accesses mapped to the same index set in a cache, so data in a block of in this index set may be discarded even though another block in another relatively unused index set might be even older. If this said block is discarded and later retrieved, that is called a conflict miss.

This doesn’t have a clean analogy to my Perkin’s library situation, but suppose that there is a policy in the library that you may only have a certain number of books of the same color at any one time. For example, you may have a maximum of three blue books, and three red books, and three yellow books, and three fuchsia books. You may have space on the desk for more, but since you have three fuchsia books, you’ll have to put a book back into stacks before you get another.

In this analogy, the index sets are like the different colors, and the amount of books is like the set associativity. In this example we had a maximum of three books of a certain color at one time, so this was like a 3-way set associative cache.

We can alleviate this problem if we increase the number of books of a certain color we can have at once. That is, if we increase the set associativity.

We can also increase the size of the cache. All other things being equal, this has the effect of increasing the amount of index sets in the cache. In our analogy that would be the same as not only enlarging the des, but also refining the spectrum of colors. For example, at first we may only consider three colors: red, yellow, and blue. However, we may include other colors, and instead our spectrum may be red, orange, brown, yellow, aquamarine, and blue. Some colors that were considered red may now be considered orange, some colors that used to be considered yellow we may now call brown, et cetera. We increase the number of books we may have on our desk in this sneaky way. Looking outside of the analogy for a second, if we double the cache size, our index field grows by one bit. Instead of an index of 10010, for example, we could have indexes of 010010 OR 110010; our “spectrum” of indexes has increased.

As you may see, a fully associative cache would not have this problem because we ignore the nonsense of index sets entirely. We only have one “index set” (if you can call it that) and you can store as many cache blocks in that index set as there is space in the cache.

Capacity Miss

A capacity miss occurs when a memory location is accessed once, but later because the cache fills up, that data is discarded, and then when we get a miss when accessing that memory location again because that data is no longer in memory.

The analogy for this is simple. Your desk fills up, so you must send back a book to make room for new books on the desk, even though you may need the book you’re sending back later.

The solution to this is to increase the cache size; that is, get a bigger desk.

Invalidation Miss

What happens if main memory whose data is also loaded in cache is modified? In that case the cache data has been invalidated, and if the computer looks for it again it will have to be reloaded from main memory to get the up-to-date information. You might say, “that could never happen; all memory accesses are piped through the memory system, so cache would catch it.” That’s not always the case. I/O devices often write directly to memory (especially in older machines), so it is possible that we may have to reload data from main memory even though it may be in cache. This is called an invalidation miss.

The analogy for this is that a new edition of a book you have on your desk comes out, and it has new or corrected information. Regardless of how recently you brought this now out-of-date book to your desk, you must get the new book rather than refer to dated material.

Writing to Memory

I alluded to the potential problems of memory writing in my brief discussion of the invalidation miss, but for the most part throughout this text I have begged the issue of what happens to cache in the event of a memory write.

Memory writes are difficult to implement because something in memory changes when you do a memory write, and it is difficult to keep cache and main memory consistent with whatever change was made by the program. Memory reads, on the other hand, are no problem. If all a program does is read memory, memory cannot change, so there is no need to make sure that cache and memory are consistent with each other.

Before we go about making cache and memory consistent, however, the primary question is where our data will go; will writes first go in cache, in which case we’d have to bring main memory up to date later (called a “write back” policy), or write directly to main memory (called a “write through” policy)? The answer is that we can choose from either of these policies.

Write Back

Write back is the policy of writing changes to cache. This is simple enough; our cache blocks stay valid since they have the latest data. The problem with that is that if modified blocks ever were overwritten in the cache, we’d have to write the now discarded data back to main memory or changes will be lost.

One inefficient solution would be to simply write all discarded data back to main memory, but this is undesirable; if every cache miss involved not only a read from main memory but in addition a write to main memory, there would be a dramatic increase in the miss penalty. A more efficient solution is to assign a “dirty bit” to each cache block. In the course of a program’s execution if a block of data is modified we set the dirty bit, and write the data back to main memory if the dirty bit for the block we’re discarding has been set.

The advantage of a write back policy is that instead of writing to slower memory, we’re writing to cache. Just as we’re more likely to read from the same location in memory in a given chunk of time, the same can be said for memory writes. Instead of writing to memory every time we have a memory write, we only write back once we’re done with that block of data.

One disadvantage of this is that its control is somewhat complicated compared to write through.

Write Through

Write through is the policy of writing directly to cache, but also writing to memory. The problem is that writing data back to RAM slows things down. The solution is to have something called a write buffer. We stick our changed data in the write buffer, and whenever we have a DRAM write cycle the write buffer is emptied and the data is changed.

A problem arises here. A write buffer exists so a computer can continue to execute instructions faster than the time it takes to write new data through to main memory. It is therefore conceivable that a whole bunch of write instructions could fill the buffer faster than the buffer’s capacity to get rid of data already in the buffer.

This is not an insurmountable problem. We can install a level two cache to hold writes. This also has the effect of speeding up the process of writing back to memory, since this L2 cache will be faster than DRAM. By appropriately regulating the flow of data between DRAM and the L2 cache, write overflow may be overcome.

Sub-blocks

This revision does not cover sub-blocks. Sorry. If it’s any comfort, I don’t even remember them being mentioned in lecture, and we never did anything with them. I’m writing this on vacation, so I’m working from memory; I can’t remember much about them aside from the fact that sub-blocks reduce the miss penalty.

Cache Aware Programming

In the previous section we talked about ways of reducing misses by changing the cache size, set associativity, etc. The problem is that we as programmers don’t deal with the machine at that level. We already have a machine, and the cache of that machine is the cache of that machine, and we must deal with that. The only way we are to minimize these misses would be to modify our program to take advantage of cache. If we have two different programs that do the same thing, it is likely that one program may use cache more efficiently than the other. There are plenty of examples of equivalent cache efficient and cache inefficient programs in the notes. Depending on the sort of program you’re working on, it may also be helpful to understand the basics of the memory system of the particular computer your program will run on.

Another source of inefficiency not mentioned (at least in my version of the notes) arises from one of the few weaknesses of object oriented programming. When you’re working with objects, you call a lot of class object functions. The result is that we’re accessing instructions (stored in memory) all over the place. This is cache inefficient, so if you have a C program and a C++ program that do the same thing, the C program will tend to run a bit faster, especially if the C++ program takes advantage of its object oriented nature.