Cache Basics
by Gene Cooperman

Copyright (c) 2003, Gene Cooperman; Rights to copy for non-commercial purposes are freely granted as long as this copyright notice remains.

Basic Concepts

The cache holds data destined for RAM, but it is faster than RAM. For simplicity, we assume a single cache (L1 only) below.

We begin by describing a direct-mapped cache (1-way set associative). A cache is divided into cache blocks (also known as cache lines). To fully specify a cache, you should specify

  1. the size of a cache,
  2. the size of a cache block,
  3. the set associativity,
  4. the cache write policy (write-through vs. write-back),
  5. and the cache set replacement policy (if the cache is set associative).
We will say that the n-th block of a cache is at index n. (Some texts call this the group number or the set number.) A CPU address can be decoded into the fields used by a cache. The fields of a cache are (from highest order bits to lowest):
  1. tag
  2. index
  3. offset (within a cache block)

A cache address can be specified simply by index and offset. The tag is kept to allow the cache to translate from a cache address (tag, index, and offset) to a unique CPU address. A cache hit means that the CPU tried to access an address, and a matching cache block (index, offset, and matching tag) was available in cache. So, the cache did not need to access RAM. In a cache miss, the CPU tries to access an address, and there is no matching cache block. So, the cache is forced to access RAM.

Example

The original Pentium 4 had a 4-way set associative L1 data cache of size 8 KB with 64 byte cache blocks. Hence, there are 8KB/64 = 128 cache blocks. If it's 4-way set associative, this implies 128/4=32 sets (and hence 2^5 = 32 different indices). There are 64=2^6 possible offsets. Since the CPU address is 32 bits, this implies 32=21+5+6, and hence 21 bits of tag field.

The original Pentium 4 also had an 8-way set associative L2 integrated cache of size 256 KB with 128 byte cache blocks. This implies 32=17+8+7, and hence 17 bits of tag field.

State Transitions (write-back, write-allocate, direct-mapped cache)

Every cache block has associated with it at least the Modify and Valid bits, and a tag address. The Valid bit says if the cache block is used (has valid data) or is unused. The Modify bit makes sense only if the Valid bit is set. The Modify bit says whether the data in the cache block is different from RAM (modified) or is the same as RAM.

A tag is said to be a matching tag if, after decoding a CPU address from a pending read or write, the tag field of the CPU address matches the tag associated with a cache block at the cache address given by (index, offset).

Note that an optimized cache will take shortcuts, as compared to the state transitions below. But it will result in the same final state.


V = 0, read/write pending
Read from RAM into cache block;
Set Tag according to RAM address; Set V = 1; Set M = 0;
Go to V=1, M=0, tag matching, either read or write pending

V = 1, M = 0, tag matching, read pending
Read from portion of cache block to CPU; Done
V = 1, M = 0, tag matching, write pending
Write from CPU to portion of cache block; Set M = 1; Done
V = 1, M = 0, tag not matching, read/write pendingu
Set V = 0, Go to V = 0, read/write pending

V = 1, M = 1, tag matching, read pending
Read from portion of cache block to CPU; Done
V = 1, M = 1, tag matching, write pending
Write from CPU to portion of cache block; Done
V = 1, M = 1, tag not matching, read/write pending
Write cache block to RAM;
Set V = 0; Go to V = 0, read/write pending
Write from CPU to portion of cache block; Done

Set Associativity

If the cache is n-way set associative, then a cache address (index, offset) specifies not just one cache block, but a set of n cache blocks. On a read from or write to cache, if any cache block in the set has a matching tag, then it is a cache hit, and that cache block is used. If no cache block in the set has a matching tag, then it is a cache miss, and one of the cache blocks of the set is chosen for replacement. Data from that cache block is written back to RAM to make room for new data exactly as in the case of tag not matching in the state transitions.

Cache Policies

A write-through cache writes a cache block back immediately after the CPU write to the cache block. A write-back cache writes the cache block back later. Most caches are write-allocate caches. This means that if the CPU attempts to write, and the cache contains no matching cache block, then the cache will first allocate (load) the matching cache block from RAM. Only after that can the CPU write to the cache. Most caches today are also non-blocking or lockup-free. This means that in case of a cache miss on write, the cache will buffer the CPU request to write to cache, and allow the CPU to continue.

If the cache is set associative and if a cache miss occurs, then the cache set replacement policy determines which cache block is chosen for replacement. A common choice is some approximation of LRU (Least Recently Used).

Other Concepts

CPUs today often have L1 (level one), L2, and possibly L3 caches. They are typically organized as inclusive caches. In this case, the L1 cache holds a fast copy of data for L2 cache, the L2 cache for L3, and the L3 cache holds a fast copy of data for RAM. If the cache sizes are L1: 32 KB, L2: 512 KB, L3: 4 MB, then the caches hold 4 MB of data from RAM.

In an exclusive cache, no two caches hold the same data from RAM. If the cache sizes are L1: 32 KB, L2: 512 KB, L3: 4 MB, then the caches hold 4.53 MB od data from RAM. As of this writing (2003), the AMD processors use an exclusive cache, while the Intel processors use an inclusive cache.