This is the 3rd part of my investigations of local LLM inference speed. Here're the 1st and 2nd ones
The speed of LLM inference is memory-bound. But what exactly does this mean? Is there a difference between standard JEDEC 4800MT/s and faster 6000MT/s XMP DDR5 sticks? Let's find out.
Test Environment
OS | Windows 11 23H2 (22631.4371) |
LLM Inference | LM Studio 0.3.4 (Build 3), when testing 100% CPU off-load 12 threads were used, when testing 100% GPU off-load Flash Attention is enabled |
CPU | Intel Core i5 13600KF overclocked (performance core multipliers 57x, 56x, 54x, 53x and 2 cores at 54x vs stock multipliers of 51x) |
RAM | DDR5 G.Skill 6000MT/s 36-36-36-96, 2x32GB and 2x16GB* |
Motherboard | Z790 PG Lightning |
GPU | RTX 4090 24GB VRAM, overclocked (+1440MHz mem frequency, +150MHz core) and power limited to 84% (~390W) |
*Made a few tests with 2x16GB and 2x32GB with a total of 96GB - due to CPU/MB limitations XMP frequencies were not achieved when all 4 slots were occupied. Max stable frequency was at 4800MT/s, timings 29-30-30-76. Most of the tests used 2x32GB config
Models
- Mistral 7B: 6 bit Q6_K, 5.94GB
mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF
, used with 32K context - Llama 3.1 8B: 16 bit, 16.07GB,
meta-llama-3.1-8b-instruct.f16.gguf
, used with 32K context (instead of supported 128K) to avoid VRAM overflows when measuring GPU for comparison
Results
Bumping DDR5 speed from 4800MT/s to 6000MT/s brought +20.3% and +23.0% generation speedup (Mistral and Llama correspondingly).
Mistral 7B
DDR5 | TTFT (Cold), s | TTFT (Warm), s | TPS | READ, MB,s | WRITE, MB/s | COPY, MB/s | Latency, ns |
---|---|---|---|---|---|---|---|
4800 (4 sticks, 96GB) | 0,89 | 0,11 | 9,42 | 69019,00 | 68482,00 | 69815,67 | 76,93 |
4800 (2 sticks, 64GB) | 0,88 | 0,11 | 9,66 | 71032,67 | 71582,67 | 72058,00 | 77,70 |
6000 (2 sticks, 64GB) | 0,66 | 0,09 | 11,34 | 87342,67 | 85591,00 | 85535,33 | 70,43 |
6200 (2 sticks, 64GB) | 0,84 | 0,09 | 11,93 | 90268,00 | 88714,00 | 88178,67 | 68,57 |
Correl | 0,99600 | 0,99640 | 0,99644 | -0,98861 | |||
R^2 | 0,99202 | 0,99282 | 0,99290 | 0,97734 |
Llama 3.1
DDR5 | TTFT (Cold, s | TTFT (Warm, s | TPS | READ, MB,s | WRITE, MB/s | COPY, MB/s | Latency, ns |
---|---|---|---|---|---|---|---|
4800 (4 sticks, 96GB) | 2,46 | 0,30 | 3,86 | 69019,00 | 68482,00 | 69815,67 | 76,93 |
4800 (2 sticks, 64GB) | 2,38 | 0,26 | 4,00 | 71032,67 | 71582,67 | 72058,00 | 77,70 |
6000 (2 sticks, 64GB) | 2,78 | 0,22 | 4,74 | 87342,67 | 85591,00 | 85535,33 | 70,43 |
6200 (2 sticks, 64GB) | 2,73 | 0,21 | 4,87 | 90268,00 | 88714,00 | 88178,67 | 68,57 |
Correl | 0,99924 | 0,99969 | 0,99983 | -0,98161 | |||
R^2 | 0,99849 | 0,99939 | 0,99966 | 0,96356 |
- Faster DDR5 means faster generation speed
- There's a STRONG linear correlation between tokens per second and AIDA-reported memory speeds (in my case read, write, and copy speeds also correlated, hence the data can't say if the particular metric is more important)
Do Cores/Threads Matter
Not that much. You might be better off with fewer/slower cores yet faster memory:
Threads | TPS | |
---|---|---|
1 | 3,18 | |
2 | 5,46 | |
3 | 7,70 | 73,0% |
4 | 9,42 | |
5 | 10,3 | |
6 | *10,55 * | |
8 | 10,83 | |
10 | 11,04 | |
12 | 11,35 | 107,58% |
3 cores/threads demonstrated 73% or 6 cores/treads. 12 threads (those ones relied on hyper threading rather than on more physical cores) brought an additional 7.6% boost over 6 core baseline.
CPU via GPU
For reference here's the comparison of 6200MT/s CPU results to RTX 4090 GPU:
CPU TPS | GPU TPS | |
---|---|---|
Mistral 7B | 11,93 | 112,23 |
Llama 3.1 8B | 4,87 | 55,46 |
Approach, Notes
- After changing the memory config I ran AIDA Memory Tests 3 times and averaged them in the final table
- For each model I used the same dialog every time regenerating the last message "Tell me about Mars"
- Recorded 4 results for each model and averaged them
- TTFT Cold - time to first token during the first generation right after the model was loaded
- TTFT Warm - time to the first token in subsequent generations
- I actually did 2 measurements of Llama 3.1 at 6200 and got exhausted waiting for the results, anyways they almost didn't fluctuate The 4-stick configuration is slower than the 2-stick configuration even with the same speed and timings. Additionally, on consumer hardware, you are unlikely to get any speeds above 4800MT/s with 4 sticks due to MB and CPU memory controller limitations. Always try using 2 slots.
- 6200 was unstable OC, failed OCCT stress test