This is the 3rd part of my investigations of local LLM inference speed. Here're the 1st and 2nd ones

The speed of LLM inference is memory-bound. But what exactly does this mean? Is there a difference between standard JEDEC 4800MT/s and faster 6000MT/s XMP DDR5 sticks? Let's find out.

Test Environment


OS	Windows 11 23H2 (22631.4371)
LLM Inference	LM Studio 0.3.4 (Build 3), when testing 100% CPU off-load 12 threads were used, when testing 100% GPU off-load Flash Attention is enabled
CPU	Intel Core i5 13600KF overclocked (performance core multipliers 57x, 56x, 54x, 53x and 2 cores at 54x vs stock multipliers of 51x)
RAM	DDR5 G.Skill 6000MT/s 36-36-36-96, 2x32GB and 2x16GB*
Motherboard	Z790 PG Lightning
GPU	RTX 4090 24GB VRAM, overclocked (+1440MHz mem frequency, +150MHz core) and power limited to 84% (~390W)

*Made a few tests with 2x16GB and 2x32GB with a total of 96GB - due to CPU/MB limitations XMP frequencies were not achieved when all 4 slots were occupied. Max stable frequency was at 4800MT/s, timings 29-30-30-76. Most of the tests used 2x32GB config

Models

Mistral 7B: 6 bit Q6_K, 5.94GB mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF, used with 32K context
Llama 3.1 8B: 16 bit, 16.07GB, meta-llama-3.1-8b-instruct.f16.gguf, used with 32K context (instead of supported 128K) to avoid VRAM overflows when measuring GPU for comparison

Results

Bumping DDR5 speed from 4800MT/s to 6000MT/s brought +20.3% and +23.0% generation speedup (Mistral and Llama correspondingly).

Mistral 7B

DDR5	TTFT (Cold), s	TTFT (Warm), s	TPS	READ, MB,s	WRITE, MB/s	COPY, MB/s	Latency, ns
4800 (4 sticks, 96GB)	0,89	0,11	9,42	69019,00	68482,00	69815,67	76,93
4800 (2 sticks, 64GB)	0,88	0,11	9,66	71032,67	71582,67	72058,00	77,70
6000 (2 sticks, 64GB)	0,66	0,09	11,34	87342,67	85591,00	85535,33	70,43
6200 (2 sticks, 64GB)	0,84	0,09	11,93	90268,00	88714,00	88178,67	68,57

			Correl	0,99600	0,99640	0,99644	-0,98861
			R^2	0,99202	0,99282	0,99290	0,97734

Llama 3.1

DDR5	TTFT (Cold, s	TTFT (Warm, s	TPS	READ, MB,s	WRITE, MB/s	COPY, MB/s	Latency, ns
4800 (4 sticks, 96GB)	2,46	0,30	3,86	69019,00	68482,00	69815,67	76,93
4800 (2 sticks, 64GB)	2,38	0,26	4,00	71032,67	71582,67	72058,00	77,70
6000 (2 sticks, 64GB)	2,78	0,22	4,74	87342,67	85591,00	85535,33	70,43
6200 (2 sticks, 64GB)	2,73	0,21	4,87	90268,00	88714,00	88178,67	68,57

			Correl	0,99924	0,99969	0,99983	-0,98161
			R^2	0,99849	0,99939	0,99966	0,96356

Faster DDR5 means faster generation speed
There's a STRONG linear correlation between tokens per second and AIDA-reported memory speeds (in my case read, write, and copy speeds also correlated, hence the data can't say if the particular metric is more important)

Do Cores/Threads Matter

Not that much. You might be better off with fewer/slower cores yet faster memory:

Threads	TPS
1	3,18
2	5,46
3	7,70	73,0%
4	9,42
5	10,3
6	10,55
8	10,83
10	11,04
12	11,35	107,58%

3 cores/threads demonstrated 73% or 6 cores/treads. 12 threads (those ones relied on hyper threading rather than on more physical cores) brought an additional 7.6% boost over 6 core baseline.

CPU via GPU

For reference here's the comparison of 6200MT/s CPU results to RTX 4090 GPU:

	CPU TPS	GPU TPS
Mistral 7B	11,93	112,23
Llama 3.1 8B	4,87	55,46

Approach, Notes

After changing the memory config I ran AIDA Memory Tests 3 times and averaged them in the final table
For each model I used the same dialog every time regenerating the last message "Tell me about Mars"
Recorded 4 results for each model and averaged them
- TTFT Cold - time to first token during the first generation right after the model was loaded
- TTFT Warm - time to the first token in subsequent generations
- I actually did 2 measurements of Llama 3.1 at 6200 and got exhausted waiting for the results, anyways they almost didn't fluctuate The 4-stick configuration is slower than the 2-stick configuration even with the same speed and timings. Additionally, on consumer hardware, you are unlikely to get any speeds above 4800MT/s with 4 sticks due to MB and CPU memory controller limitations. Always try using 2 slots.
6200 was unstable OC, failed OCCT stress test

DDR5 Speed, CPU and LLM Inference