Fast at Computing, Slow at Remembering

In the previous post, we saw how a CPU processes instructions through the Fetch-Decode-Execute cycle. In the simulator, memory access happened instantly, but that’s not how it works in reality. There’s a speed gap of hundreds of times between the CPU and memory, so no matter how fast the CPU is, waiting for memory becomes a bottleneck. Making memory faster shrinks its capacity; increasing capacity makes it slower. You can’t have memory that’s both fast and large.

How can we solve this problem?

Desk, Drawer, Storage Room

We already solve a similar problem in everyday life. Imagine you’re studying right now. You keep the textbook you’re reading open on your desk, put occasionally referenced materials in a drawer, and store books you won’t use this semester in a storage room.

Why do we do this? If you kept all books in storage, you’d waste time fetching them every time. But you can’t put all books on your desk either — the desk is too small. So we naturally use the strategy of keeping frequently used things closer.

Could we use the same strategy in computers? But do “frequently used data” really exist?

Locality

When we observe the patterns of how programs access data, we discover interesting regularities.

Temporal locality: Data that has been accessed once is likely to be accessed again soon. A typical example is reading and writing the same variable in a loop.

Memory address

Like a countdown program that repeatedly executes the same instructions, addresses 0-3 keep being accessed again and again.

Spatial locality: Data near a recently accessed location is likely to be accessed soon. This happens when reading consecutive data from beginning to end.

Memory address

When reading sequential data, adjacent addresses are accessed one after another.

Programs don’t use all of memory evenly. They tend to access specific regions intensively. So what if we created a small, fast storage close to the CPU for frequently used data?

Cache

That storage is the cache. A cache is a small, high-speed memory located between the CPU and RAM. When the CPU requests data, it checks the cache first.

Cache hit: If the data is in the cache, it’s fetched immediately. Fast.
Cache miss: If the data isn’t in the cache, it’s fetched from RAM and copied into the cache. Slow.

Memory address

Cache

····

RAMAll addresses

Try the following steps:

Click a memory address. Since the cache is empty at first, a miss occurs.
Click the same address again. This time it’s in the cache, so a hit occurs.
Click 5 or more different addresses. With only 4 cache slots, the least recently accessed data gets evicted.

When the cache is full, the least recently used data is evicted to make room for new data. The proportion of hits among all accesses is called the hit rate. A higher hit rate means less time the CPU spends waiting for RAM.

So how different are the hit rates between patterns with and without locality?

Locality and Hit Rate

Let’s compare two patterns with a cache size of 4.

HitMiss

Hit rate -%

Press the “Sequential” button. The pattern repeats 0-3 twice, then 4-7 twice. Hits start occurring from the second repetition, giving a high hit rate.
Press the “Random” button. Same number of accesses, but to random addresses. The hit rate drops significantly. Try pressing it multiple times to compare.

The reason caches are effective is because of locality. Since programs access data with locality, even a small cache can achieve a high hit rate. Without locality, like random access, the cache becomes ineffective.

But making a cache larger makes it slower. Since we can’t satisfy both size and speed, what if we split it into stages?

Memory Hierarchy

Instead of making one large cache, what if we stacked multiple levels — small and fast caches alongside larger and slower ones? Thanks to locality, most accesses end at the smallest, fastest level. Only when the data isn’t found there do we go down to the next level. This makes the average access speed faster than a single large cache.

Real computers use this approach. The levels closer to the CPU are small and fast, while those farther away are large and slow, connected in stages. This is called the memory hierarchy.

Level	Capacity	Speed
Registers	Tens of entries	1 cycle
L1 Cache	Tens of KB	~4 cycles
L2 Cache	Hundreds of KB	~10 cycles
L3 Cache	Tens of MB	~40 cycles
RAM	Several to tens of GB	~200 cycles

Higher levels are faster and smaller; lower levels are slower and larger. Each level acts as a cache for the level below it. When the CPU requests data, it checks L1 first, then L2, then L3, and finally RAM.

Memory address

··

~4 cycles

····

~10 cycles

········

~40 cycles

RAM

All addresses

~200 cycles

Click the same address repeatedly. It hits in L1 immediately.
Click many different addresses. Data gets evicted from upper caches, requiring access to lower levels.

The simulator above has only 2 slots in L1, so data gets evicted quickly, but real L1 caches are tens of KB — most accesses end there. This allows the CPU to operate as if it had large, fast memory.

Below RAM are secondary storage devices like SSDs and HDDs. When access reaches this level, the latency increases to tens of thousands to hundreds of thousands of cycles.

Conclusion

In this series, starting from 0s and 1s in the first post, we learned logic gates, made selections with MUX, performed calculations with adders, created memory with latches, established computer design principles with Turing machines and von Neumann architecture, executed instructions with the CPU, and solved the speed problem with caches and the memory hierarchy. We’ve assembled a working computer from 0s and 1s.

Now that we understand computer architecture, we need ways to handle data efficiently. Performance varies depending on how you organize data and in what order you process it. The next part covers data structures and algorithms.