# Computer Architecture ## Week 12: Cache Fenerbahçe Üniversitesi #### Professor & TAs Prof: Dr. Vecdi Emre Levent Office: 311 Email: emre.levent@fbu.edu.tr TA: Arş. Gör. Uğur Özbalkan Office: 311 Email: ugur.ozbalkan@fbu.edu.tr ## Course Plan • Cache ## Programs 101 #### Load/Store Architectures: - Read data from memory (put in registers) - Manipulate it - Store it back to memory #### C Code ``` int main (int argc, char* argv[]) { int i; int m = n; int sum = 0; for (i = 1; i <= m; i++) { sum += i; } printf ("...", n, sum); }</pre> ``` #### RISC-V Assembly ``` main: addi sp, sp, -48 x1,44(sp) SW fp,40(sp) SW fp,sp move x10, -36(fp) SW x11, -40(fp) SW x15, n lw x15,0(x15) x15, -28(fp) SW x0,-24(fp) SW li x15,1 sw x15, -20(fp) 1w x14, -20(fp) L2: 1w \times 15, -28(fp) blt x15, x14, L3 ``` ■ Instructions that read from or write to memory... - + big - slow - far away CPU Main Memory ## The Need for Speed #### Instruction speeds: - add, sub, shift: 1 cycle - mult: 3 cycles - load/store: 100 cycles (2 GHz processor → 0.5 ns clock, off-chip 50 ns) ## What's the solution? ## **Locality Locality** - the same thing again soon - > Temporal Locality - something near that thing, soon - → Spatial Locality ``` total = 0; for (i = 0; i < n; i++) total += a[i]; return total;</pre> ``` ## The Memory Hierarchy ## Some Terminology #### Cache hit - data is in the Cache - t<sub>hit</sub>: time it takes to access the cache - Hit rate (%hit): # cache hits / # cache accesses #### Cache miss - data is **not** in the Cache - t<sub>miss</sub>: time it takes to get the data from below the \$ - Miss rate (%miss): # cache misses / # cache accesses #### Cacheline or cacheblock or simply line or block Minimum unit of info that is present/or not in the cache ## Single Core Memory Hierarchy ## Multi-Core Memory Hierarchy ## Memory Hierarchy by the Numbers ## CPU clock rates ~0.33ns – 2ns (3GHz-500MHz) | Memory<br>technology | Transistor count | Access time | Access time in cycles | \$ per GB<br>in 2021 | Capacity | |----------------------|---------------------------------|-------------|-----------------------|----------------------|----------| | SRAM<br>(on chip) | 6-8 transistors | 0.5-2.5 ns | 1-3 cycles | \$4k | 256 KB | | SRAM<br>(off chip) | | 1.5-30 ns | 5-15 cycles | \$4k | 32 MB | | DRAM | 1 transistor<br>(needs refresh) | 50-70 ns | 150-200 cycles | \$10-\$20 | 8 GB | | SSD (Flash) | | 5k-50k ns | Tens of thousands | \$0.75-\$1 | 512 GB | | Disk | | 5M-20M ns | Millions | \$0.05-\$0.1 | 4 TB | ## Basic Cache Design **Direct Mapped Caches** ## 16 Byte Memory load 1100 → r1 - Byte-addressable memory - 4 address bits → 16 bytes total - b addr bits $\rightarrow$ 2<sup>b</sup> bytes in memory | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## 4-Byte, Direct Mapped Cache | | | CACHE | | |-------|-------|-------|-----------------------------------| | index | index | data | | | XXXX | 00 | А | ← Cache entry | | | 01 | В | = row<br>= (cacho) line | | | 10 | С | = (cache) line<br>= (cache) block | | | 11 | D | Block Size: 1 byte | #### **Direct mapped:** - Each address maps to 1 cache block - 4 entries $\rightarrow$ 2 index bits (2<sup>n</sup> $\rightarrow$ n bits) | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## 4-Byte, Direct Mapped Cache tag|index XXXX #### **CACHE** | tag | data | |-----|------| | 00 | Α | | 00 | В | | 00 | С | | 00 | D | Tag: minimalist label/address address = tag + index | addr | data | |------|------| | 0000 | Α | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## 4-Byte, Direct Mapped Cache #### **CACHE** | V | tag | data | |---|-----|------| | 0 | 00 | Χ | | 0 | 00 | Х | | 0 | 00 | Х | | 0 | 00 | Х | One last tweak: valid bit #### **MEMORY** | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## Simulation #1 of a 4-byte, DM Cache | tag | index | |-----|-------| | XX | XX | #### **CACHE** | V | tag | data | |---|-----|------| | 1 | 11 | N | | 0 | XX | X | | 0 | XX | X | | 0 | XX | Х | load Miss #### Lookup: - Index into \$ - Check tag - Check valid bit #### **MEMORY** | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## Simulation #1 of a 4-byte, DM Cache #### **MEMORY** | tag i | | | | CACHE | | |------------------|------------------------|------------------|-----|---------------------------------------------|----| | | | idex V | tag | data | | | | | <b>&gt;</b> 00 1 | 11 | N | | | | | 01 0 | XX | X | | | | | 10 0 | XX | X | | | | | 11 0 | XX | X | | | load<br><br>load | 1100<br>1100<br>esome! | | | Lookup: • Index in • Check ta • Check value | ag | | addr | data | |------|------| | 0000 | Α | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## Block Diagram 4-entry, direct mapped Cache ## Simulation #2: 4-byte, DM Cache #### **CACHE** | | ٧ | tag | data | |----|---|-----|------| | 00 | 0 | 11 | X | | | 0 | 11 | Х | | | 0 | 11 | Х | | | 0 | 11 | Х | | →load | 1100 | |-------|------| | load | | | load | | | load | | #### Lookup: → Index into \$ → Check tag → Check valid bit ## Simulation #2: 4-byte, DM Cache #### **CACHE** | V | tag | data | |---|-----|------| | 1 | 11 | N | | 0 | 11 | Х | | 0 | 11 | Х | | 0 | 11 | Х | | load | | |------|--| | load | | | load | | | load | | #### Look - Index into \$ - Check tag - Check valid bit | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## Simulation #2: 4-byte, DM Cache #### **CACHE** | V | tag | data | |---|------|------| | 1 | 11 | N | | 0 | [11] | Х | | 0 | 11 | X | | 0 | 11 | Х | load 1100 →load 1101 load 0100 load 1100 #### Lookup: → Index into \$ → Check tag → Check valid bit | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | 4 Q | ## Simulation #2: 4-byte, DM Cache | tag | index | |-----|-------| | XX | XX | #### CACHE | V | tag | data | |---|-----|------| | 1 | 11 | N | | 1 | 11 | 0 | | 0 | 11 | Х | | 0 | 11 | Х | | load | | |------|------| | load | 1101 | | load | | | 1 1 | | #### Looku - Index into \$ - Check tag - Check valid bit | addr | data | |------|------| | 0000 | Α | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## Simulation #2: 4-byte, DM Cache #### **CACHE** | V | tag | data | |---|-----|------| | 1 | 11 | N | | 1 | 11 | 0 | | 0 | 11 | Х | | 0 | 11 | Х | | | load | |---------------|------| | | load | | $\Rightarrow$ | load | | | load | ## Lookup | N. | اء ما | | | |----------|-------|-----|--| | | ına | lex | | | <i>y</i> | | | | → Check tag → Check valid bit | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | 6 Q | ## Simulation #2: 4-byte, DM Cache | tag | index | |-----|-------| | XX | XX | #### CACHE | V | tag | data | |---|-----|------| | 1 | 01 | Е | | 1 | 11 | 0 | | 0 | 11 | Х | | 0 | 11 | Х | | load | | |------|------| | load | | | load | 0100 | | 1004 | | #### Lookup - Index into \$ - Check tag - Check valid bit | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | H | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ## Simulation #2: 4-byte, DM Cache #### **CACHE** | V | tag | data | |---|-----|------| | 1 | 01 | Е | | 1 | 11 | 0 | | 0 | 11 | Х | | 0 | 11 | Х | | load | | Lookup: | |-------|------|--------------| | load | | → Index into | | load | | → Check tag | | ⇒load | 1100 | ⇒ Check vali | | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | 8 Q | ## Simulation #2: 4-byte, DM Cache tag|index CACHE | V | tag | data | |---|-----|------| | 1 | 11 | N | | 1 | 11 | 0 | | 0 | 11 | Х | | 0 | 11 | Х | | load | 1100 | Miss | Dicannaintad | |------|------|------|--------------| | load | 1101 | Miss | Disappointed | | load | 0100 | Miss | | | load | 1100 | Miss | $\bigcirc$ | | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | 9 Q | # Reducing Misses by Increasing Block Size Leveraging Spatial Locality ## **Increasing Block Size** #### CACHE offset XXXX | V | tag | data | |---|-----|-------| | 0 | Х | A B | | 0 | Х | C D | | 0 | Х | E F | | 0 | X | G H | - Block Size: 2 bytes - Block Offset: least significant bits indicate where you live in the block - Which bits are the index? tag? #### **MEMORY** | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | | load | 100 | Looku | |------|------|----------| | load | 1101 | ⇒ Inde | | load | 0100 | –∕ illue | | load | 1100 | → Che | #### **MEMORY** | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | #### CACHE | V | tag | data | |---|-----|-------| | 0 | X | X X | | 0 | Х | X X | | 1 | 1 | N O | | 0 | X | X X | load 1100 Mis load 1101 Hit load 0100 load 1100 ## Lookup: - → Index into \$ - → Check tag | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | #### CACHE ## Lookup: - → Index into \$ - → Check tag | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Η | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | # index V tag data on the stage of | load | <b>11</b> 00 | | |------|--------------------|--| | load | 11 <mark>01</mark> | | | load | 0100 | | | load | 110) | | #### Lookup: - → Index into \$ - → Check tag #### **MEMORY** 1111 Q # Simulation #3: 8-byte, DM Cache ### **CACHE** ### Lookup: → Index into \$ → Check tag ### **MEMORY** | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | # Simulation #3: 8-byte, DM Cache ### **CACHE** ### **MEMORY** | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | М | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | Dr. V. E. Levent Computer Architecture # Removing Conflict Misses with Fully-Associative Caches # 8 byte, fully-associative Cache | V | tag | data | V | tag | data | V | tag | data | / tag | data | |---|-----|------------|---|-----|------------|---|-----|------------|-------|------| | 0 | XXX | $X \mid X$ | 0 | XXX | $X \mid X$ | 0 | XXX | $X \mid X$ | ) xxx | XXX | What should the **offset** be? What should the **index** be? What should the **tag** be? ### **MEMORY** 1111 # Simulation #4: 8-byte, FA Cache xxxx tag|offset ### **CACHE** | V | tag | data | V | ta | |---|-----|-------|---|-----| | 0 | XXX | X X | 0 | XXX | | load | 110) | |------|------| | load | 1101 | | load | 0100 | | load | 1100 | Miss ### Lookup: - Index into \$ - Check tags. - □ Check valid bits ### **MEMORY** ### **MEMORY** # Simulation #4: 8-byte, FA Cache tag | offset ### **CACHE** | | | | | | | | | data | | | | |---|-----|-------|---|-----|-------|---|-----|-------|---|-----|-------| | 1 | 110 | N O | 0 | XXX | X X | 0 | XXX | X X | 0 | XXX | X X | | | | | 1 | | | | | | | | | | load | 1100 | |------|------| | load | 1101 | | load | 0100 | | load | 1100 | Miss Hit! ### Lookup: - Index into \$ | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | E | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | ### **MEMORY** # Simulation #4: 8-byte, FA Cache xxxx tag offset ### **CACHE** | V | tag | data | V | tag | data | V | tag | data | 1 | / | tag | data | |---|-----|-------|---|-----|-------|---|-----|-------|---|---|-----|-------| | 1 | 110 | N O | 0 | XXX | X X | 0 | XXX | X X | | | XXX | X X | | | | | 1 | | | | | | | | | | | load | 1100 | |------|------| | load | 1101 | | load | 0100 | | load | 1100 | ### Lookup: Index into \$ → Check tags | addr | data | |------|------| | 0000 | А | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | N | | 1101 | 0 | | 1110 | Р | | 1111 | Q | # Simulation #4: 8-byte, FA Cache xxxx tag|offset ### **CACHE** | | | | | | data | | | | |---|-----|-------|---|-----|-------|---|-----|-------| | 1 | 110 | N O | 1 | 010 | E F | 0 | XXX | X X | | | | | | | | | | | | 1100 | |------| | 1101 | | 0100 | | 1100 | | | Hi Mi Lookup: Index into \$ → Check tags → Check valid bits ### **MEMORY** data $X \mid X$ tag 0 xxx Dr. V. E. Levent Computer Architecture # Pros and Cons of Full Associativity - + No more conflicts! - + Excellent utilization! But either: Parallel Reads – lots of reading! **Serial Reads** lots of waiting # Pros & Cons | | Direct Mapped | Fully Associative | |----------------------|---------------|-------------------| | Tag Size | Smaller | Larger | | SRAM Overhead | Less | More | | Controller Logic | Less | More | | Speed | Faster | Slower | | Price | Less | More | | Scalability | Very | Not Very | | # of conflict misses | Lots | Zero | | Hit Rate | Low | High | # Reducing Conflict Misses with Set-Associative Caches Not too conflict. Not too slow. ... Just Right! ### **MEMORY** # 8 byte, 2-way set associative Cache | V | tag | data | |---|-----|-------| | 0 | XX | E F | | 0 | XX | C D | | V | tag | data | |---|-----|-------| | 0 | XX | N O | | 0 | XX | P Q | What should the **offset** be? What should the **index** be? What should the **tag** be? | addr | data | |------|------| | 0000 | Α | | 0001 | В | | 0010 | С | | 0011 | D | | 0100 | Е | | 0101 | F | | 0110 | G | | 0111 | Н | | 1000 | J | | 1001 | K | | 1010 | L | | 1011 | M | | 1100 | Ν | | 1101 | 0 | | 1110 | Р | | 1111 | Q | # 8 byte, 2-way set associative Cache ### **MEMORY** # 8 byte, 2-way set associative Cache ### **MEMORY** ### MEMORY | 151 | VIUKT | SET OF SAME | |-----|-------|-------------| | lr | data | ENERBA | | 0 | А | * 2016 * | | 1 | В | | # 8 byte, 2-way set associative Cache # 8 byte, 2-way set associative Cache ### **MEMORY** # 24 byte, 3-way set associative Cache 5 bit address2 byte block size24 byte, 3-Way Set Associative CACHE | V | tag | data | |---|-----|-------| | 0 | ٠٠ | X Y | | 0 | ٠٠ | X Y | | 0 | ٠٠ | X Y | | 0 | ? | X Y | | V | tag | data | |---|-----|---------| | 0 | | X' Y' | | 0 | | X' Y' | | 0 | ; | X' Y' | | 0 | | X' Y' | | <b>&lt;</b> | tag | data | |-------------|-----|-----------| | 0 | ٠- | X'' Y'' | | 0 | | X'' Y'' | | 0 | | X'' Y'' | | 0 | ? | X'' Y'' | How many tag bits? - 4) ( - 3) 1 - C) 2 - D) 3 - E) 4 # 24 byte, 3-way set associative Cache 5 bit address2 byte block size24 byte, 3-Way Set Associative CACHE | ٧ | tag | data | |---|-----|-------| | 0 | ٠. | X Y | | 0 | ٠. | X Y | | 0 | | X Y | | 0 | ? | X Y | | V | tag | data | |---|-----|---------| | 0 | ٠٠ | X' Y' | | 0 | ٠. | X' Y' | | 0 | ٠. | X' Y' | | 0 | | X' Y' | | V | tag | data | |---|-----|-----------| | 0 | ٠- | X'' Y'' | | 0 | ? | X'' Y'' | | 0 | 5 | X'' Y'' | | 0 | ? | X'' Y'' | How many tag bits? = 2 ### **Eviction Policies** Which cache line should be evicted from the cache to make room for a new line? - Direct-mapped: no choice, must evict line selected by index - Associative caches - Random: select one of the lines at random - Round-Robin: similar to random - FIFO: replace oldest line - LRU: replace line that has not been used in the longest time ### Misses: the Three C's Cold (compulsory) Miss: never seen this address before Conflict Miss: cache associativity is too low • Capacity Miss: cache is too small ## Miss Rate vs. Block Size ### **Block Size Tradeoffs** - For a given total cache size, Larger block sizes mean.... - fewer lines - so fewer tags, less overhead - and fewer cold misses (within-block "prefetching") - But also... - fewer blocks available (for scattered accesses!) - so more conflicts - can decrease performance if working set can't fit in \$ - and larger miss penalty (time to fetch block) # Miss Rate vs. Associativity # Which caches get what properties? # 2-Way Set Associative Cache (Reading) # 3-Way Set Associative Cache (Reading) # Performance Calculation with \$ Hierarchy ### Parameters $$t_{avg} = t_{hit} + \%_{miss} * t_{miss}$$ - Reference stream: all loads - D\$: $t_{hit} = 1 \text{ns}$ , $\%_{miss} = 5\%$ - L2: $t_{hit} = 10 \text{ns}$ , $\%_{miss} = 20\%$ (local miss rate) - Main memory: $t_{hit} = 50$ ns - What is t<sub>avgD\$</sub> without an L2? - $t_{missDS} =$ - t<sub>avgD\$</sub> = - What is t<sub>avgD\$</sub> with an L2? - t<sub>missD\$</sub> = - t<sub>avgL2</sub> = - t<sub>avgD\$</sub> = # Performance Calculation with \$ Hierarchy ### Parameters $$t_{avg} = t_{hit} + \%_{miss} * t_{miss}$$ - Reference stream: all loads - D\$: $t_{hit} = 1 \text{ns}$ , $\%_{miss} = 5\%$ - L2: $t_{hit} = 10 \text{ns}$ , $\%_{miss} = 20\%$ (local miss rate) - Main memory: t<sub>hit</sub> = 50ns ### What is t<sub>avgD\$</sub> without an L2? • $$t_{\text{missD}}$$ = $t_{\text{hitM}}$ • $$t_{avgD\$} = t_{hitD\$} + m_{missD\$} t_{hitM} = 1 ns + (0.05*50 ns) = 3.5 ns$$ ## What is t<sub>avgD\$</sub> with an L2? • $$t_{\text{missD}}$$ = $t_{\text{avgL2}}$ • $$t_{avgD\$} =$$ $$t_{hitL2} + \%_{missL2} * t_{hitM} = 10 \text{ns} + (0.2*50 \text{ns}) = 20 \text{ns}$$ $$t_{hitD\$} + \%_{missD\$} * t_{avgL2} = 1ns + (0.05*20ns) = 2ns$$ # **Performance Summary** ### Average memory access time (AMAT) depends on: - cache architecture and size - Hit and miss rates - Access times and miss penalty ### Cache design a very complex problem: - Cache size, block size (aka line size) - Number of ways of set-associativity (1, N, $\infty$ ) - Eviction policy - Number of levels of caching, parameters for each - Separate I-cache from D-cache, or Unified cache - Prefetching policies / instructions - Write policy # Takeaway Direct Mapped → fast, but low hit rate Fully Associative → higher hit cost, higher hit rate Set Associative → middleground Cache performance is measured by the average memory access time (AMAT), which depends cache architecture and size, but also the access time for hit, miss penalty, hit rate. ### What about Stores? We want to write to the cache. If the data is not in the cache? Bring it in. (Write allocate policy) Should we also update memory? - Yes: write-through policy - No: write-back policy ### Instructions: LB $x1 \leftarrow M[1]$ LB $x2 \leftarrow M[7]$ SB $x2 \rightarrow M[0]$ SB $x1 \rightarrow M[5]$ LB $x2 \leftarrow M[10]$ SB $x1 \rightarrow M[5]$ SB $x1 \rightarrow M[10]$ 16 byte, byte-addressed memory 4 byte, fully-associative cache: 2-byte blocks, write-allocate 4 bit addresses: 3 bit tag, 1 bit offset | Iru V tag | data | |-----------|------| | 1 0 | | | 0 0 | | | | | ### **Register File** ### Cache Misses: 0 Hits: 0 Reads: 0 Writes: ( ### Memory **78** 29 120 123 71 150 162 **173** 18 21 10 33 28 19 200 210 225 Dr. V. E. Levent Computer Architecture Dr. V. E. Levent Computer Architecture Dr. V. E. Levent Computer Architecture Dr. V. E. Levent Computer Architecture #### **Register File** Misses: 2 Hits: 0 Reads: 4 Writes: 0 ### Write-Through (REF 3) Dr. V. E. Levent Computer Architecture ### Write-Through (REF 4) Dr. V. E. Levent Computer Architecture #### Write-Through (REF 4) Dr. V. E. Levent Computer Architecture #### Write-Through (REF 4) Memory Register File Misses: 3 Hits: 1 Reads: Writes: ### Write-Through (REF 5) **Register File** x0 29 **x**1 33 x2 **x**3 ``` Instructions: LB x1 \leftarrow M[1] LB x2 \leftarrow M[7] SB x2 \rightarrow M[0] Hit SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] M SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] ``` **Register File** Misses: 4 Hits: 1 Reads: 8 Writes: 2 #### Write-Through (REF 6) Dr. V. E. Levent Computer Architecture #### Write-Through (REF 7) Memory Misses: 4 Hits: 2 Reads: 8 Writes: 3 #### Write-Through (REF 7) Dr. V. E. Levent Computer Architecture ## Summary: Write Through Write-through policy with write allocate - Cache miss: read entire block from memory - Write: write only updated item to memory - Eviction: no need to write to memory # Next Goal: Write-Through vs. Write-Back #### What if we DON'T to write stores immediately to memory? - Keep the current copy in cache, and update memory when data is evicted (write-back policy) - Write-back all evicted lines? - No, only written-to blocks ## Write-Back Meta-Data (Valid, Dirty Bits) | V | D | Tag | Byte 1 | Byte 2 | Byte N | |---|---|-----|--------|--------|--------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | - V = 1 means the line has valid data - D = 1 means the bytes are newer than main memory - When allocating line: - Set V = 1, D = 0, fill in Tag and Data - When writing line: - Set D = 1 - When evicting line: - If D = 0: just set V = 0 - If D = 1: write-back Data, then set D = 0, V = 0 ### Write-back Example - Example: How does a write-back cache work? - Assume write-allocate #### Instructions: LB $x1 \leftarrow M[1]$ LB $x2 \leftarrow M[7]$ SB $x2 \rightarrow M[0]$ SB $x1 \rightarrow M[5]$ LB $x2 \leftarrow M[10]$ SB $x1 \rightarrow M[5]$ SB $x1 \rightarrow M[10]$ 16 byte, byte-addressed memory 4 btye, fully-associative cache: 2-byte blocks, write-allocate 4 bit addresses: 3 bit tag, 1 bit offset #### **Register File** x0 x1 x2 x3 Cache 0 Misses: 0 Hits: 0 Reads: Writes: 0 #### Write-Back (REF 1) # Instructions: LB $x1 \leftarrow M[1]$ LB $x2 \leftarrow M[7]$ SB $x2 \rightarrow M[0]$ SB $x1 \rightarrow M[5]$ LB $x2 \leftarrow M[10]$ SB $x1 \rightarrow M[5]$ SB $x1 \rightarrow M[5]$ SB $x1 \rightarrow M[5]$ # Register File Misses: 0 Hits: 0 Reads: 0 Writes: 0 | Memory | | | | | | |--------|-----|--|--|--|--| | 0 | 78 | | | | | | 1 | 29 | | | | | | 2 | 120 | | | | | | 3 | 123 | | | | | | 4 | 71 | | | | | | 5 | 150 | | | | | | 6 | 162 | | | | | | 7 | 173 | | | | | | 8 | 18 | | | | | | 9 | 21 | | | | | | 10 | 33 | | | | | | 11 | 28 | | | | | | 12 | 19 | | | | | | 13 | 200 | | | | | | 14 | 210 | | | | | | 15 | 225 | | | | | #### Write-Back (REF 1) ``` Instructions: LB x1 \leftarrow M[1] LB x2 \leftarrow M[7] SB x2 \rightarrow M[0] SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] ``` # Register File Misses: 1 Hits: 0 Reads: 2 Writes: 0 #### Write-Back (REF 2) #### **Register File** Misses: 1 Hits: 0 Reads: 2 Writes: 0 | Memory | | | | | | | |--------|-----|--|--|--|--|--| | 0 | 78 | | | | | | | 1 | 29 | | | | | | | 2 | 120 | | | | | | | 3 | 123 | | | | | | | 4 | 71 | | | | | | | 5 | 150 | | | | | | | 6 | 162 | | | | | | | 7 | 173 | | | | | | | 8 | 18 | | | | | | | 9 | 21 | | | | | | | 10 | 33 | | | | | | | 11 | 28 | | | | | | | 12 | 19 | | | | | | | 13 | 200 | | | | | | | 14 | 210 | | | | | | | 15 | 225 | | | | | | #### Write-Back (REF 2) ``` Instructions: LB x1 \leftarrow M[1] M LB x2 \leftarrow M[7] M SB x2 \rightarrow M[0] SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] ``` #### **Register File** Misses: 2 Hits: 0 Reads: 4 Writes: 0 #### Write-Back (REF 3) Cache 225 Memory Register File x0 x1 29 x2 173 x3 Misses: 2 Hits: 0 Reads: 4 Writes: 0 #### Write-Back (REF 3) ``` Instructions: LB x1 \leftarrow M[1] LB x2 \leftarrow M[7] SB x2 \rightarrow M[0] Hit SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] ``` **Register File** x0 x1 x2 **x**3 29 **173** Dr. V. E. Levent Computer Architecture #### Write-Back (REF 4) ``` Instructions: LB x1 \leftarrow M[1] LB x2 \leftarrow M[7] SB x2 \rightarrow M[0] Hit SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[10] ``` x0 29 x1 **173** x2 **x**3 **Register File** Misses: 2 Hits: Reads: Writes: #### Write-Back (REF 4) ``` Instructions: LB x1 \leftarrow M[1] M LB x2 \leftarrow M[7] M SB x2 \rightarrow M[0] Hit SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] ``` #### **Register File** \* 2016 \* Dr. V. E. Levent Computer Architecture #### Write-Back (REF 4) ## Write-Back (REF 5) ``` Instructions: LB x1 \leftarrow M[1] M LB x2 \leftarrow M[7] M SB x2 \rightarrow M[0] Hit SB x1 \rightarrow M[5] M LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] SB x1 \rightarrow M[5] ``` **Memory** Dr. V. E. Levent Computer Architecture #### Write-Back (REF 5) Dr. V. E. Levent Computer Architecture #### Write-Back (REF 5) Dr. V. E. Levent Computer Architecture #### Write-Back (REF 6) ``` Instructions: LB x1 \leftarrow M[1] LB x2 \leftarrow M[7] Hit SB x2 \rightarrow M[0] M SB x1 \rightarrow M[5] M LB x2 \leftarrow M[10] SB x1 \rightarrow M[5] SB x1 \rightarrow M[10] ``` Dr. V. E. Levent Computer Architecture #### Write-Back (REF 6) Dr. V. E. Levent Computer Architecture #### Write-Back (REF 7) ``` Instructions: LB x1 \leftarrow M[1] LB x2 \leftarrow M[7] Hit SB x2 \rightarrow M[0] M SB x1 \rightarrow M[5] LB x2 \leftarrow M[10] M SB x1 \rightarrow M[5] Hit SB x1 \rightarrow M[10] ``` Dr. V. E. Levent Computer Architecture #### Write-Back (REF 7) Dr. V. E. Levent Computer Architecture | Instructions: | | |----------------------------------------------------------------------------------------------------|------------| | SB $$1 \rightarrow M[5]$ LB $$2 \leftarrow M[10]$ | M<br>Hit | | $\begin{array}{c} \text{SB } \$1 \rightarrow M[5] \\ \text{SB } \$1 \rightarrow M[10] \end{array}$ | M<br>M | | SB $$1 \rightarrow M[5]$<br>SB $$1 \rightarrow M[10]$ | Hit<br>Hit | Cheap subsequent updates! | Iru V d tag data | | | | | | | |------------------|----|--|--|--|--|--| | 0 1 1 101 | 29 | | | | | | | | 28 | | | | | | | 1 1 1 010 | 71 | | | | | | | 29 | | | | | | | Memory **Register File** Misses: 4 Hits: 3 Cache Reads: Writes: ``` Instructions: M ... SB $1 \rightarrow M[5] Hit SB $2 \leftarrow M[10] SB $1 \rightarrow M[5] Hit SB $1 \rightarrow M[10] Hit SB $1 \rightarrow M[10] Hit SB $1 \rightarrow M[10] Hit ``` Register File Misses: 4 Hits: 3 Reads: Writes: ### How Many Memory References? #### Write-back performance - How many reads? - Each miss (read or write) reads a block from mem - 4 misses $\rightarrow$ 8 mem reads - How many writes? - Some evictions write a block to mem - 1 dirty eviction → 2 mem writes ## Write-back vs. Write-through Example Assume: large associative cache, 16-byte lines N 4-byte words Write-thru: n reads (n/4 cache lines) n writes Write-back: n reads (n/4 cache lines) 4 writes (one cache line) Write-thru: n reads (n/4 cache lines) n writes Write-back: n reads (n/4 cache lines) n writes (n/4 cache lines) ### So is write back just better? Short Answer: Yes (fewer writes is a good thing) Long Answer: It's complicated. - Evictions require entire line be written back to memory (vs. just the data that was written) - Write-back can lead to incoherent caches on multi-core processors ## Optimization: Write Buffering - Q: Writes to main memory are slow! - A: Use a write-back buffer - A small queue holding dirty lines - Add to end upon eviction - Remove from front upon completion - Q: When does it help? - A: short bursts of writes (but not sustained writes) - A: fast eviction reduces miss penalty #### Write-through vs. Write-back - Write-through is slower - But simpler (memory always consistent) - Write-back is almost always faster - write-back buffer hides large eviction cost - But what about multiple cores with separate caches but sharing memory? - Write-back requires a cache coherency protocol - Inconsistent views of memory - Need to "snoop" in each other's caches - Extremely complex protocols, very hard to get right ### Cache-coherency Q: Multiple readers and writers? A: Potentially inconsistent views of memory #### Cache coherency protocol - May need to **snoop** on other CPU's cache activity - **Invalidate** cache line when other CPU writes - **Flush** write-back caches before other CPU reads - Or the reverse: Before writing/reading... - Extremely complex protocols, very hard to get right ### Takeaway - Write-through policy with write allocate - Cache miss: read entire block from memory - Write: write only updated item to memory - Eviction: no need to write to memory - Slower, but cleaner - Write-back policy with write allocate - Cache miss: read entire block from memory - \*\*But may need to write dirty cacheline first\*\* - Write: nothing to memory - Eviction: have to write to memory entire cacheline because don't know what is dirty (only 1 dirty bit) - Faster, but more complicated, especially with multicore ``` // H = 6, W = 10 int A[H][W]; for(x=0; x < W; x++) for(y=0; y < H; y++) sum += A[y][x];</pre> ``` **MEMORY** Every access a cache miss! (unless entire matrix fits in cache) THE STATE OF S **MEMORY** ### **Cache Conscious Programming** | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | |---|---|---|---|---|---|---|---|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1 | 2 | 3 | 4 | |---|---|---|---| | 5 | 6 | 7 | 8 | | | | | | | | | | | | | | | | | | | | | **CACHE** • Block size = 4 → 75% hit rate • Block size = 8 → 87.5% hit rate • Block size = $16 \rightarrow 93.75\%$ hit rate And you can easily prefetch to warm the cache **MEMORY** YOUR MIND ## A Real Example - Dual 32K L1 Instruction caches - 8-way set associative - 64 sets - 64 byte line size - Dual 32K L1 Data caches - Same as above - Single 6M L2 Unified cache - 24-way set associative (!!!) - 4096 sets - 64 byte line size - 4GB Main memory - 1TB Disk