Last year I spent probably two weeks trying to get a decent Llama 3 70B setup running on a machine that, on paper, looked fine. Decent RAM, mid-range desktop CPU, no GPU. The result? About 1.5 tokens per second. You could watch it think. Not in a cool way.

So I went down the rabbit hole on what actually matters for local AI inference, and the answer surprised me a bit. It’s not really about CPU cores or clock speed the way you’d expect. The whole conversation is about memory how much of it you have, and how fast the CPU can reach it. Everything else is secondary.
This guide is for people running Ollama, LM Studio, or llama.cpp directly on their own hardware. Doesn’t matter if you’re a developer, a homelab person, or just someone who doesn’t want to pay $40/month in API costs anymore. The goal is to tell you which CPUs actually move the needle and which ones are just fast on paper.
Do LLMs Even Need a Powerful CPU?
Short answer: it depends what you’re trying to do with it.
If you have a good GPU with enough VRAM to hold the whole model say an RTX 4090/5090 with 24GB your CPU barely matters during inference. Ollama and llama.cpp will push everything to the GPU, and the CPU is mostly sitting there handling the operating system and the HTTP server. A mid-range Ryzen 5 or even an older i7 is completely fine in that situation.
The problem is that 24GB runs out fast. A Llama 3 70B model at Q4 quantization is around 40GB. A DeepSeek R1 32B is somewhere around 20GB, so it fits. But you start pushing into 70B territory or try running a big Qwen3 MoE model, and suddenly you’re hitting the VRAM wall. That’s when the CPU has to step in and handle the layers that don’t fit.
This is called CPU-GPU hybrid inference, and llama.cpp does it automatically. You tell it how many layers to offload to the GPU; the rest run on CPU. The speed you get depends on both.
For pure CPU inference, no GPU at all, performance is mostly about how fast your CPU can read model weights from RAM. Every single token generated requires reading those weights from memory. So a CPU with 90 GB/s memory bandwidth will give you roughly half the tokens per second of a CPU with 180 GB/s bandwidth, even if the slower one has more cores. I tested this directly; the numbers basically track the bandwidth.
Does Ollama use CPU? Yes, it does. Ollama falls back to CPU automatically if you don’t have a supported GPU, and even with a GPU it uses the CPU for anything that doesn’t fit. Open the activity monitor while running a 70B model and you’ll see both CPU and GPU at 100%.
What Makes a CPU Good for LLMs?
Memory bandwidth is the thing. Like I said above, token generation is a memory-read-heavy workload. Every token you generate means reading a big chunk of weights from RAM into the CPU’s processing units. If you have a Ryzen 9 9950X with DDR5–6000 running in dual channel, you’re getting roughly 90 GB/s. If you have the Ryzen AI Max+ 395 with itsLPDDR5X-8000 running in quad channel, you’re getting 256 GB/s. That gap is why the Phoronix benchmarks from last year showed the Ryzen AI Max+ delivering more than 2x the LLM throughput of the 9950X in pure CPU inference — the memory bandwidth difference is nearly that big.
RAM capacity matters a lot too, and for a simple reason: the whole model has to fit somewhere. If it doesn’t fit in VRAM, it goes in system RAM. If it doesn’t fit in system RAM, you can’t run it at all (well, you can with disk-backed memory, but the speed is effectively zero for practical use). This table is roughly what you need:
+------------+----------------------------+
| Model Size | Minimum RAM (Q4 Quantized) |
+------------+----------------------------+
| 7B | 8–10 GB |
| 14B | 10–14 GB |
| 32B | 20–24 GB |
| 70B | 38–45 GB |
| 100B+ | 70–90 GB |
| 235B MoE | 100–130 GB |
+------------+----------------------------+The “minimum” numbers above assume a quantized model and nothing else running. In practice you want more headroom. Context windows eat into it, a 70B model with 8192 context can push the requirement up by 8–16GB depending on KV cache size.
Core count helps but in a different way than you’d think. More cores don’t speed up token generation much because that’s memory-bandwidth bound, not compute bound. What more cores actually helps with is prompt processing (how fast the first token appears) and running multiple requests at once. If you’re building something that handles concurrent users, cores matter more. For solo daily use with a single chat session, the extra cores from a Threadripper over a 9950X might be barely noticeable.
Quantization is how you fit big models into smaller RAM. Q4_K_M is the sweet spot, it cuts a 70B model from ~140GB (full BF16) down to around 40GB with minimal quality loss. Q8 is higher quality but uses double the memory. Q2 fits everything but the model quality gets janky at that point.
And look, the NPU situation is worth mentioning because the marketing is a bit misleading. The Ryzen AI Max+ 395 has a 50 TOPS XDNA 2 NPU. AMD has talked it up a lot. But as of mid-2026, Ollama, llama.cpp, and LM Studio all route LLM workloads to the GPU, not the NPU. The NPU does video upscaling and image classification. For actual LLM inference, ignore the TOPS number entirely.
Best CPUs for Running LLMs in 2026
+------------------------+------------------+----------------+--------------------------+
| CPU | Memory Bandwidth | Max RAM | Best For |
+------------------------+------------------+----------------+--------------------------+
| AMD Ryzen AI Max+ 395 | 256 GB/s | 128 GB unified | Large models, all-in-one |
| Apple M5 Max | 614 GB/s | 128 GB unified | macOS, efficiency |
| AMD Ryzen 9 9950X | 90 GB/s | 256 GB system | GPU-paired desktop |
| AMD Threadripper 9970X | ~180 GB/s | 2 TB+ | Huge models, workstation |
| Intel Xeon W9-3595X | ~300 GB/s | 2 TB+ | Enterprise inference |
+------------------------+------------------+----------------+--------------------------+AMD Ryzen AI Max+ 395, Best for Local AI Overall
This thing changed the conversation. The Ryzen AI Max+ 395 (AMD calls the architecture “Strix Halo”) packs a 40-core RDNA 3.5 integrated GPU and up to 128GB of unified memory into a chip that runs in 65W mini PCs. The GMKtec EVO-X2, which uses this chip, costs under $1,800 in the 128GB config and runs Qwen3–235B at around 11 tokens per second. A 235B model. On a lunchbox-sized machine.
The reason it can do this is unified memory. There’s no separate GPU VRAM, the iGPU and CPU share the same 128GB pool at 256 GB/s. So you can give most of that pool to the GPU (up to 96GB on Windows, 120GB on Linux with some kernel tweaks) and load models that would never fit on any consumer discrete GPU.
Real benchmark from AMD’s ROCm blog (May 2026, ROCm 7.2.1, Ollama 0.20.x, Ubuntu 24.04): the Qwen3.5 122B MoE model runs at 8.59 tokens/second with a 61% GPU / 39% CPU split. That’s a 122 billion parameter model running without any cloud subscription.
The bad news: ROCm support on gfx1151 (the GPU architecture in this chip) has had some rough patches. There’s a known SIGSEGV crash in older Ollama versions on Linux. The Lemonade SDK from lemonade-sdk/llamacpp-rocm provides fixed nightly builds that target gfx1151 specifically and actually run faster than stock Ollama. You’ll probably need to use that rather than the default Ollama binary if you’re on Linux.
On Windows it’s simpler. AMD’s Variable Graphics Memory gives you up to 96GB for the GPU and Ollama just works.
Pros: Runs 70B models easily, fits 120B+ MoE models, desktop and mini PC options, no VRAM ceiling
Cons: ROCm support still needs work on Linux, the NPU marketing is misleading, bandwidth lower than discrete GPUs for small models
Price range: $1,800–$4,000 depending on form factor (GMKtec EVO-X2 vs AMD Halo desktop)
Apple M5 Max, Best If You’re on macOS
The M5 Max started in MacBook Pros on March 11, 2026. The Mac Studio M5 variant is still delayed to around October 2026 because of RAM supply issues, Also the mac mini with m5 pro
Memory bandwidth is 614 GB/s on the 40-core GPU variant. That is more than double the Ryzen AI Max+ 395. And on small-to-medium models, you feel it, community benchmarks on llmcheck.net show Llama 3 8B Q4 at 82 tokens/second on M5 Max. Qwen 3.6–35B MoE at 55 tokens/second. Llama 3 70B Q4 at 18 tokens/second.
The TTFT (time to first token) improvement is where M5 really stands out. Apple added Neural Accelerators directly into every GPU core. Prompt processing is compute-bound, not bandwidth-bound, so a 14B model that took 81 seconds to process on M4 Max takes 18 seconds on M5 Max. For interactive use that feels like a completely different machine.
But. The 614 GB/s is LPDDR5X bandwidth. An RTX 4090/5090 has 1,008 GB/s via GDDR6X. So on small models that fit easily in a GPU’s VRAM, a discrete NVIDIA card is faster per token. The M5 Max wins on models that need a lot of memory. It also uses 60–90W under load, versus 350W+ for an RTX 4090 system.
One annoying thing: Ollama doesn’t fully use the M5’s Neural Accelerators yet. MLX does. LLM Studio switched to MLX backend support in 2026 and is probably the best way to get full M5 performance. If you’re still running stock Ollama on Apple Silicon and haven’t tried MLX, you’re leaving 20–30% speed on the table.
Pros: Best bandwidth on the market, great efficiency, 128GB config handles 70B comfortably, TTFT is fast
Cons: macOS only, expensive (128GB MacBook Pro starts around $4,499), not upgradeable after purchase, Mac Studio with M5 still delayed
Price range: $3,499–$5,000+(A lot lot of money)
AMD Ryzen 9 9950X, Best Desktop CPU for GPU-Paired Builds?
If you have a GPU or plan to get one this is the right desktop CPU for a local LLM workstation. 16 cores, 32 threads, AVX-512 support, DDR5, up to 256GB system RAM, PCIe 5.0. It sits on AM5, which means you can upgrade to Zen 6 later without changing the motherboard.
For pure CPU inference, the 9950X does about 11–12 tokens/second on an 8B model according to the LocalScore benchmark. That’s fine for small models. It’s not fast enough to make a 70B model pleasant. With a GPU doing most of the work, though, the CPU’s job is just handling whatever spills past VRAM, and at 90 GB/s it can manage that decently.
The 9950X3D2 (launched April 22, 2026) is worth knowing about if you’re doing CPU-heavy inference without a GPU.It has 128MB+ of L3 cache (versus 64MB on the standard 9950X) by stacking 3D V-Cache on both chiplets. For small models, fitting more weights in L3 cache rather than fetching from RAM can improve token throughput by 40–60% on cached 7B inference. For 70B models the memory bandwidth bottleneck dominates and the cache advantage shrinks. It costs roughly $699-$799 versus $549 for the standard 9950X.
Pros: Great platform, widely available, PCIe 5.0, AM5 upgrade path, AVX-512
Cons: Only 90 GB/s memory bandwidth hurts CPU-only inference, dual-channel DDR5 is the ceiling
Price range: $549–$799
AMD Threadripper 9970X, Best for Very Large Models, Also heavy on your pocket
If you need to run 70B+ models on CPU, or want to build a small inference server that handles multiple users, this is the direction to go. The Threadripper 9970X has 32 cores, quad-channel DDR5 with up to 2TB RAM support, and roughly 180 GB/s memory bandwidth, double what the 9950X can do.
That 2TB ceiling matters. Enterprise use cases, long context windows, running multiple models simultaneously it adds up. The platform is expensive (the CPU alone is over $2,500 and the motherboards are $500+) but if you’re comparing it to cloud inference costs over 12 months, the math sometimes works out.
Pros: Quad-channel bandwidth, massive RAM support, 32 cores for multi-user setups
Cons: Very expensive, platform cost is high, overkill for single-user local use
Price range: $2,500+ CPU, total build easily $5,000+
Intel Xeon W9–3595X, For Enterprise, Mostly
Intel’s Xeon W9–3595X has 8 memory channels and supports huge amounts of RAM. For multi-user inference servers where you need consistent throughput and ECC memory, it makes sense. For home or dev use, it’s overkill and the cost is significant.
Intel’s general position on local LLM inference in 2026 is not great. AMD has the memory bandwidth advantage at every tier, and Apple has Apple Silicon. Intel’s NPU story is compelling for light tasks (noise cancellation, Copilot features in Windows) but doesn’t help with serious model inference. The Core Ultra 200 series is fine for everyday tasks but doesn’t compete with the Ryzen AI Max+ for on-device AI.
Best CPU for Ollama
Ollama is where most people start, so let’s break this down by budget.
Budget ($500–$1,000 total build): A Ryzen 9 9950X or Ryzen 9 9900X paired with an RTX 5070ti (16GB VRAM) covers 7B-14B models really well. For 7B models with GPU inference, you’ll be at 60–90 tokens/second. The CPU is mostly overhead; the GPU does the work. Don’t go below 32GB system RAM.
Mid-range ($1,500–$2,500): Either a Ryzen AI Max+ 395 mini PC (the GMKtec EVO-X2 at under $1,800 in 128GB) or a desktop build with a Ryzen 9 9950X plus an RTX 4090. The mini PC wins for portability and not needing a separate GPU. The desktop wins for raw speed on small models. Ollama handles both fine.
High-end ($3,000+): Apple M5 Max MacBook Pro or Mac Studio (when it ships), or a Ryzen AI Max+ 395 system if you want Windows/Linux. The M5 Max 128GB handles 70B models at 18 tok/s in Ollama. Not blazing but comfortable for daily use.
One thing I’ve noticed with Ollama specifically: the auto-detection for GPU layers doesn’t always get the balance right on the Ryzen AI Max+ 395. If you’re on Linux and it feels slower than expected, check that ROCm is actually being used with ollama ps and make sure you've allocated GPU memory correctly in BIOS. I wasted two days running everything on CPU before I realized ROCm wasn't finding the iGPU. (Should have found it sooner…)
Best CPU for DeepSeek R1
DeepSeek R1 comes in several sizes and each has pretty different hardware needs.
DeepSeek R1 7B: This fits in the VRAM of basically any modern discrete GPU (RTX 4060 or newer). CPU barely matters. Even a mid-range Ryzen 5 with an RTX 3060 runs this at 30+ tok/s. If you’re only running 7B models, don’t overthink the CPU.
DeepSeek R1 14B: Needs 14–16GB of space. Fits on an RTX 4090 (24GB) easily, or you can run it fully on CPU with 32GB RAM at around 5–8 tok/s depending on your CPU and memory speed. The 9950X handles it reasonably. The Ryzen AI Max+ 395 with GPU offload is faster.
DeepSeek R1 32B: 20–24GB at Q4. Doesn’t fit on most GPUs. You either need something like an RTX 5090 (32GB VRAM), or you run it on CPU/hybrid. The Ryzen AI Max+ 395 with 64GB GPU allocation handles it well. M5 Max with 64GB handles it at a decent speed.
DeepSeek R1 70B, 40+ GB at Q4. Needs unified memory or Threadripper with 128GB+ RAM. The Ryzen AI Max+ 395 128GB is the consumer pick here. Expect around 10–15 tok/s on the iGPU.
DeepSeek R1 671B: (the full version) is basically not runnable on any single consumer machine. You’d need 4 Ryzen AI Max+ 395 nodes using llama.cpp’s RPC clustering mode, which AMD has actually written guides for. The Threadripper with 512GB+ RAM could theoretically do it too but it would be very slow.
AMD vs Intel vs Apple for Local AI

+---------------------+---------------------------------------+---------------------+-------------------+
| Feature | AMD | Intel | Apple |
+---------------------+---------------------------------------+---------------------+-------------------+
| Memory Bandwidth | 256 GB/s (Max+ 395) / 90 GB/s (9950X) | ~300 GB/s (Xeon W9) | 614 GB/s (M5 Max) |
| Power Efficiency | Good to excellent | Okay | Excellent |
| Price | $549–$2,500+ | $1,500+ | $3,499+ |
| Upgradeability | High | High | None |
| Large Model Support | Up to 128 GB unified | Up to 2 TB (Xeon) | Up to 128 GB |
| Ecosystem | Windows/Linux | Windows/Linux | macOS only |
+---------------------+---------------------------------------+---------------------+-------------------+The way I see it: if you’re on macOS already and want the cleanest experience with the best hardware for the money, M5 Max is hard to argue with. If you want Windows or Linux and care about running 70B+ models without building a huge workstation, the Ryzen AI Max+ 395 is the most interesting thing that happened in local AI hardware in the last year. And if you’re building a GPU-paired desktop, AMD Ryzen 9 9950X is the practical default.
Intel doesn’t have a compelling consumer answer right now. The NPU stuff is nice for light tasks but it’s not what drives LLM performance. Their Xeon line matters for enterprise servers. For homelab and dev use, it’s an AMD/Apple story.
Running LLMs Without a GPU
It works, but you need realistic expectations. Token generation on CPU is memory-bandwidth bound, so you get maybe 10–20 tok/s on an 8B model with a good desktop CPU, and 3–8 tok/s on a 70B model even with a Threadripper. That’s readable-speed output but it’s not fast.
Models that run well on CPU:
- Qwen3 7B or 8B: fits fully in cache on the 9950X3D2, decent speed
- Gemma 3 4B / 12B: lightweight and efficient
- Phi-4 Mini: 3.8B, runs really fast on basically anything
- Mistral 7B: the classic, still fine for everyday tasks
Models that struggle on CPU:
- Anything 70B+ in full BF16, you need either massive RAM or you’re waiting
- Large MoE models with huge total parameter counts, the active parameters are small but loading the routing tables still hits bandwidth
Honestly, for pure CPU inference in 2026, the Ryzen AI Max+ 395 is actually the best desktop-class CPU for this even though it’s technically an APU. Quad-channel LPDDR5X gives it 256 GB/s bandwidth vs 90 GB/s on the 9950X. The iGPU makes it even better. It’s just the most capable silicon for on-device inference right now outside of Apple Silicon.
How Much RAM Do You Actually Need?
The table I shared earlier is the minimum. In practice you want more headroom than minimum, especially if you’re doing long context generations.
A 70B model with a 32K context window can push memory use up significantly. Bigger context means bigger KV cache, and the KV cache sits in the same memory pool as the model weights. If you’re using Open WebUI with multiple conversations active, you can hit RAM limits faster than you’d expect.
My rough guide:
- 7B daily use: 16GB is enough, 32GB is comfortable
- 14B-32B daily use: 32GB minimum, 64GB preferred
- 70B: 64GB at minimum (you’ll be tight), 128GB to be relaxed
- Running multiple models or long contexts: double whatever the model alone needs
And quantization changes everything. A 70B model at Q2_K fits in 24GB but the quality is noticeably worse, it hallucinates more and misses nuance on complex tasks. Q4_K_M at 40GB is the standard for good quality. Q8 at 70GB+ is nice but the quality jump from Q4 is smaller than the memory jump. Unless you’re doing something where exact recall matters a lot, Q4_K_M is the right default.
FAQ
Is AMD or Intel better for LLMs?
AMD, by a clear margin for local inference in 2026. Memory bandwidth advantage at every price tier, better AVX-512 support in llama.cpp, and the Ryzen AI Max+ 395 is in a class of its own for unified memory inference.
Is Apple Silicon good for AI?
It’s excellent. M5 Max has 614 GB/s memory bandwidth which is higher than any other consumer option. The limitation is macOS only and no upgradeability. For someone already using Mac for development, it’s probably the cleanest local AI experience right now.
Can a Ryzen 7 run Llama?
Yes. A Ryzen 7 9700X can run Llama 3 8B on CPU at maybe 8–12 tok/s, or much faster with a GPU. For 70B you need more RAM than most Ryzen 7 builds support and the speed will be frustrating without a GPU assist.
Can I run DeepSeek locally?
Yes, the smaller variants easily. DeepSeek R1 7B runs on basically any modern machine. DeepSeek R1 70B needs 64GB+ of RAM or unified memory. The full 671B is a multi-machine project.
What CPU do I need for a 70B model?
Ryzen AI Max+ 395 with 128GB unified memory is the consumer-friendly answer. Apple M5 Max 128GB works well too. For a traditional desktop, Threadripper with 128GB+ system RAM and a large GPU combination.
Is memory bandwidth more important than core count?
For token generation: yes, definitely. Core count helps more with prompt processing and running multiple users simultaneously.
Can I run Ollama on a laptop?
Yes. Any modern laptop runs 7B models fine. For 70B, you want a MacBook Pro M5 Max or a laptop with the Ryzen AI Max+ 395 (Asus ProArt Studiobook, ASUS ROG Flow Z13 variants use this chip). A standard laptop with 16GB RAM and integrated graphics is going to struggle past 7B.
Do CPUs matter if I already have a GPU?
Less than people think, but yes in some cases. If your GPU has enough VRAM for the whole model, the CPU is mostly overhead. If you’re doing hybrid inference (model too large for VRAM), CPU bandwidth matters. Platform choice matters for how much RAM you can install.
Is unified memory better for AI?
For large models: significantly better. No VRAM ceiling, the GPU can access all system RAM at full speed. For small models that fit in discrete VRAM, discrete GPU wins on raw bandwidth (1,000+ GB/s for high-end NVIDIA vs 256–614 GB/s for unified memory).
Where Things Are Headed
The GMKtec EVO-X2 at under $1,900 running 235B MoE models is genuinely a new thing. Six months ago you needed a $10,000 workstation for that. AMD’s own Ryzen AI Halo desktop opened pre-orders in June 2026 at $3,999 with the same chip in a proper desktop chassis.
The M5 Ultra Mac Studio is delayed to October 2026 due to RAM supply issues. When it ships it’ll have 256GB unified memory and bandwidth around 1,200 GB/s, probably the single most capable consumer device for local LLM inference in history. Also the Mac Mini m5 Pro, Worth watching.
If you’re building or buying right now: for the most capable CPU-centric local AI system, the Ryzen AI Max+ 395 is it. For macOS users, M5 Max. For a GPU-paired desktop on a budget, Ryzen 9 9950X and spend the saved money on a better GPU.
The bottleneck is always memory bandwidth. Buy more of it than you think you need, Which will cost more than you think, in this Ramappocolypse