So I spent about three weeks running local LLMs on my machine, and I have some things to say.
This started because I got tired of paying for API credits every time I wanted to test something small. I figured, okay, I have a decent enough machine, let me just run these models locally and see what happens. What I did not expect was how much of a mess the whole thing would be. Some models refused to install properly. One of them kept spitting out garbage tokens after about four responses. And at least two of them were so slow on my setup that I genuinely thought my system had frozen.
I am not a researcher. I don’t have a 4090 or some fancy cloud setup. Just a mid-range PC with 32GB RAM and an RTX 3060 with 12GB VRAM. So everything you read here is what actually happened on that kind of hardware, not some benchmark lab result.

Why Local LLMs Even Matter
First, a quick thing. If you’ve never tried running a model locally, you might wonder why anyone bothers. You have ChatGPT, Claude, Gemini. They all work fine. Why run something on your own machine?
There are actually a few good reasons. Privacy is the big one. If you’re dealing with client data, internal documents, code that your company hasn’t open-sourced, you probably don’t want that going through some third-party API. Running locally means nothing leaves your machine. The other reason is cost. Once you’re past the setup, running local models is basically free. No per-token charges. No rate limits. And there’s a niche but real use case for offline access when you’re traveling or your internet is patchy.
The problems are speed, RAM, and quality. You trade a lot of convenience for those benefits. Whether the trade is worth it depends on what you’re actually trying to do.
The Models I Tested
I ran ten models total. Mostly through Ollama, which is the easiest way to get started with local LLMs if you haven’t tried it. A few I ran through LM Studio because Ollama was being weird with some quantizations. The models were: Llama 3.1 8B, Llama 3.1 70B (4-bit), Mistral 7B, Mistral Nemo 12B, Phi-3.5 Mini, Phi-3 Medium, Gemma 2 9B, Gemma 2 27B (4-bit), Qwen2.5 7B, and DeepSeek-R1 8B.
I tested each one on the same set of tasks. Coding (Python mostly), summarization, creative writing, instruction following, and basic reasoning. I also just chatted with them for a while to see how they felt to talk to. That last part is harder to quantify but it matters when you’re actually using these things day to day.
For coding I used the same five prompts across all models: write a Python script to parse a CSV and flag missing values, debug a broken FastAPI route I pasted in, explain what a given piece of async code is doing, refactor a messy function, and write a simple regex for Indian phone number formats. For summarization I pasted the same 2,000-word article about cloud pricing into each one. For reasoning I used a few multi-step word problems and one logic puzzle that most 7B models fail on.
I want to be upfront about something. These are not scientific benchmarks. I didn’t run 50 prompts per model and average the scores. This is just what I observed over roughly 200 total interactions spread across three weeks of using these models for real tasks, not synthetic tests. My impressions could be off for edge cases. But I think that’s actually more useful than a leaderboard number, because I was using them the way a normal person would.
The Good Ones
Llama 3.1 8B surprised me. This thing is fast, it fits easily in VRAM, and the quality is honestly better than I expected for its size. For coding tasks it was decent, maybe 70% of the time giving me something actually useful. Where it struggled was with anything requiring longer context or more complex reasoning. Ask it to debug a 200-line script and it starts losing track around the middle. But for quick code snippets, answering straightforward questions, or just brainstorming? It works.
Mistral Nemo 12B is probably my personal favorite from this test. It’s a bit heavier than the 7B models but still runs okay on 12GB VRAM. The instruction following on this one is noticeably better. I gave it some multi-step tasks and it actually kept track of what it was supposed to do, which not all of them managed. The writing quality also felt more natural than some of the others.
Qwen2.5 7B was a surprise. I almost didn’t include it because I assumed the Chinese-origin models would be weaker in English. That was wrong. Qwen2.5 7B handles code really well, probably the best among the 7B-range models I tested. It was also fast. I am still not sure why it doesn’t get more attention in local LLM communities, but it should.
Gemma 2 9B is solid for its size. Google put real work into this one. The summarization quality was above average. Reasoning tasks were okay. Nothing blew me away but nothing broke either.
The Disappointing Ones
Phi-3.5 Mini gets a lot of hype because Microsoft keeps promoting it as this super-efficient small model. And look, for a 3.8B model it’s fine. But people run it expecting it to compete with 7B models and it doesn’t. The responses are shorter than they should be for complex questions, and the instruction following is noticeably weaker. It’s good for a phone or a very low-RAM machine but if you have 12GB VRAM you can do better.
Phi-3 Medium (14B) is actually better than Mini by a meaningful amount, and I should separate them because they’re often discussed together. The Medium version handled multi-step instructions properly and its Python output was clean. The problem is size. At 14B it eats most of my VRAM and leaves little room for a long context. It also had one specific failure I couldn’t ignore: for any prompt that involved real-world current events or recent data, it would sometimes confidently make stuff up with zero indication it wasn’t sure. Llama and Mistral at least say “I’m not certain about this.” Phi-3 Medium just answers like it knows.
DeepSeek-R1 8B is weird. The full DeepSeek-R1 is supposed to be great at reasoning and it probably is at larger sizes. But the 8B version I ran kept producing these long chains of internal “thinking” that you could see in the output, which sometimes was interesting and sometimes was just noise before a mediocre answer. The actual final answer quality did not feel better than Llama 3.1 8B for my tasks. I wanted to like this one more.
Gemma 2 27B (4-bit) was a nightmare to load on my setup. It technically fit in RAM after swapping some layers to CPU but the speed was so bad it was basically unusable. Four tokens per second. I sat there watching it generate one word at a time like it was 2019 again. If you have a stronger machine this might be fine, but on a 3060 it was painful.
The Big Boy Problem
I also tried Llama 3.1 70B at 4-bit quantization. This was the one I was most curious about because 70B models at 4-bit supposedly get close to the quality of closed models. And the quality was good, actually. Better than everything else I tested. The problem is it needed both my GPU VRAM and a chunk of system RAM to run, and even then it was maybe 3–5 tokens per second. For casual chatting that’s kind of okay. For anything where you’re waiting for a long code output, it’s frustrating.
This is sort of the core tradeoff with local LLMs. The models that are actually good enough to replace a paid API need hardware that most people don’t have. And the models that run well on normal hardware are good but not great.
Ranking Them Honestly
Here is where I’ll just say what I actually think. If you have a 12GB VRAM card and 32GB system RAM, this is the order I’d suggest:
Mistral Nemo 12B for general use and instruction following. Qwen2.5 7B for coding tasks specifically. Llama 3.1 8B as a fast fallback when you want something quick. Gemma 2 9B if you do a lot of summarization.
Skip Phi-3.5 Mini unless your hardware is very limited. Skip DeepSeek-R1 8B until the 14B or 32B versions become more accessible. Don’t bother with the 27B models unless you have 24GB VRAM minimum.
Mistral 7B I should mention separately because it’s the old classic. People still use it. It still works. But honestly the newer 7B-range models have passed it. Qwen2.5 7B and Llama 3.1 8B both beat it on most tasks I ran. You can still use it, especially if you’re familiar with it already, but for new setups there are better choices now.
How These Models Respond to Prompts Is Very Different
This is something that doesn’t show up in benchmarks but matters a lot in practice. Different models have different “personalities” in how they respond to the same prompt, and some of them are way easier to work with than others.
Llama 3.1 and Mistral Nemo both follow instructions pretty literally. If you say “give me just the code, no explanation,” they do that. Gemma 2 has a tendency to add little summaries or notes at the end even when you told it not to. That’s not the end of the world but it gets annoying when you’re piping output to something else. Phi-3 Medium is the worst offender here. It adds disclaimers to almost everything, especially anything that touches on health, legal stuff, or security. I asked it to write a script that checks for open ports on a server and it gave me half a response with safety notes.
The DeepSeek-R1 thinking behavior I mentioned earlier also changes how you have to prompt. The model visibly “thinks out loud” in a block before giving you the answer. For some reasoning tasks that’s actually useful because you can see where it went wrong. But for anything where you just want a clean output, you have to tell it explicitly to skip the chain-of-thought or strip it out afterward.
Qwen2.5 7B is probably the most “obedient” model in this group. Least likely to add unsolicited commentary, most likely to actually follow format instructions. I told it to respond only with JSON for a specific task and it actually did that. Three other models “forgot” mid-response and started adding regular text.
The Stuff Nobody Tells You About Setup
Installation is not the hard part. Ollama makes pulling models easy. ollama pull mistral and you're done. The hard part is figuring out what quantization to use, because the difference between Q4_K_M and Q5_K_S is not obvious from the names and the documentation is not great.
Quick version: Q4_K_M is your go-to for most cases. It’s 4-bit quantized in a way that preserves quality better than basic Q4. Q5_K_S adds a bit more quality but takes more RAM. Q8 is almost unquantized quality but the file sizes are huge. Avoid Q2 and Q3 unless you’re desperate for space. Your model will be noticeably stupider.
I also ran into a weird issue with Ollama where it was only using CPU for one of the models even though the GPU should have handled it. Turns out if you pull a model when another is already running in memory, sometimes it offloads the new one incorrectly. The fix was just restarting Ollama. Took me an embarrassing amount of time to figure that out.
And context length is a thing to watch. A lot of models advertise 128K context but the local quantized versions you can actually run often don’t handle long context well at 4-bit. In practice I found that after about 8K tokens some models start repeating themselves or forgetting earlier parts of the conversation. For normal tasks this is fine. For long documents it becomes a real limitation.
What These Models Are Actually Good For
So here’s what I kept coming back to during testing. Local LLMs are not ChatGPT replacements for most people. If you’re expecting GPT-4-level quality on consumer hardware, you’ll be let down.
But they are genuinely useful for specific things. Running code analysis on internal files without leaking anything. Quick summaries of private documents. Chatting through ideas when you want zero latency and don’t care if the response is 90% quality instead of 99%. Building small automation scripts where you need LLM output but can’t afford per-API-call pricing at scale.
A friend at a small dev shop in Bangalore started using Mistral Nemo 12B locally for their internal code review assistant in late 2024. Not for everything, just first-pass review before the developer actually reads the diff. They said it cut their review queue by about 30% because a lot of obvious stuff got caught earlier. That’s the kind of use case where local models make sense. Not replacing your best tools. Plugging a gap where cost or privacy makes the cloud option impractical.
The Honest Assessment
If you have a capable GPU (at least 8GB VRAM, ideally 12GB) and you’re okay spending a few hours on setup, local LLMs are worth experimenting with. The best 7B-12B models today are legitimately useful for a lot of daily tasks, and the gap between them and the big cloud models has narrowed compared to even a year ago.
But it’s not for everyone. If you want something that just works with zero friction, local LLMs will frustrate you. The setup can break for weird reasons. Models behave differently across quantizations. Some tasks the models just can’t do well no matter how you prompt them.
The thing I came away believing is that the sweet spot right now is using local models for low-stakes, high-frequency tasks and keeping the paid APIs for anything where quality really matters. That combination is actually quite cost-effective. Most of what I use an LLM for day to day is stuff where a 7B model is good enough.
As of June 2026, Ollama just released version 0.5.4 last week with better multi-GPU support, and there are reports that Llama 3.2 in the smaller sizes is coming soon with better multilingual handling. The pace of improvement here is fast enough that anything I tested might look different in a few months. New models drop almost every week now.
So worth watching. Worth trying if you haven’t. Just go in with realistic expectations.