Donate to support freedom.
Get the same

VSCodium + Ollama: Local LLM Coding Setup Guide

A private AI coding assistant - with realistic home-laptop Ollama model picks

Step-by-step local LLM coding setup: install Ollama, pick realistic laptop-local Qwen3.5, Gemma 4, Qwen3.6, or DeepSeek R1 models by hardware tier, and connect VSCodium with Continue.


Time to read: 12 min

Illustration of a local code editor, terminal, and model node representing a private VSCodium and Ollama coding setup.

Why run your LLM locally for coding?

This guide walks through a complete local LLM coding setup using VSCodium, Continue, and Ollama - no cloud, no API costs, your code stays on your machine. By the end you'll have AI code completion, inline chat, and refactoring running entirely offline. Local models now handle a large share of everyday coding tasks at zero ongoing cost, as long as you pick a model that fits your actual laptop.

The important detail: I am talking about models you can actually keep running on a home laptop. A model that needs hundreds of gigabytes of RAM may be "available" in Ollama, but it is not useful for this setup. For a normal laptop, the practical range is 4B-14B. For a 32 GB MacBook Pro or a laptop with a serious GPU, 26B-35B Q4/MLX variants become realistic.

Model recommendations last checked: May 21, 2026. This guide focuses on models you can realistically run on a home laptop with Ollama, not benchmark-only 400 GB checkpoints.

The stack:

  • VSCodium (telemetry-free VS Code builds)
  • Continue (in-editor AI assistant for completions, chat, and refactors)
  • Ollama (local model runtime)

Works on macOS, Linux, and Windows. No subscription required.

Advantages and disadvantages

Advantages

  • Privacy by default: your code never leaves your machine.
  • Zero subscription cost — a GitHub Copilot alternative that runs on hardware you already own.
  • Predictable costs and no vendor lock-in.
  • Full control over model choice, prompts, and context.

Disadvantages

  • Hardware limits how large and fast your models can be.
  • Quality varies across models; you may need to try a few.
  • You are responsible for updates: new releases like Qwen3.6, Gemma 4, or refreshed DeepSeek R1 distills can appear quickly, and you decide when to upgrade.
  • You still need clear prompts and scoped tasks to get good results.

Why VSCodium instead of VS Code

VS Code's source is MIT licensed, but Microsoft's distributed binaries add a separate license and enable telemetry by default. VSCodium ships ready-to-use builds with telemetry disabled, so you start with a privacy-first editor and no extra setup.

Why local AI beats cloud for many teams

Local models do not always beat the best cloud models on raw intelligence. But they win on what matters day-to-day:

  • Privacy by default: your code stays on your machine.
  • Lower cost: no subscriptions or per-seat fees.
  • Offline work: keep shipping even without internet access.
  • Control: pick the model, tune prompts, and switch anytime.

The local stack: VSCodium + Continue + Ollama

Continue lives inside the editor and provides local AI code completion, chat, code edits, refactors, and explanations — all powered by your local Ollama instance. Ollama runs the model and exposes it to Continue via a standard REST API. Together they give you a complete local AI workflow with zero cloud dependencies.

What you need

Confirm your hardware before installing. Minimum requirements for a working local LLM coding setup:

Minimum (CPU-only):

  • 8 GB RAM
  • macOS, Linux, or Windows
  • 10 GB free disk space
  • Any modern multi-core CPU (4+ cores)

Recommended (comfortable performance):

  • 16 GB RAM, or a GPU with 8 GB VRAM (NVIDIA/AMD/Intel Arc)
  • Apple Silicon M-series (unified memory is ideal - a 16 GB M2 runs 8B-9B models smoothly)

CPU-only note: Ollama runs without a GPU. Expect 2–6 tokens/sec on 3B–4B models. Sufficient for completions and async review; slow for real-time chat on larger models.

Ollama installation (more detail)

macOS

  1. Install with Homebrew: brew install ollama
  2. Start the service: ollama serve
  3. Verify the install: ollama -v

Linux

  1. Install with the official script: curl -fsSL https://ollama.com/install.sh | sh
  2. Start the service: ollama serve
  3. Verify the install: ollama -v

Windows

  1. Download the installer from https://ollama.com and follow the prompts.
  2. Open a new terminal and run: ollama -v
  3. Start the service: ollama serve

First model pull

  • Run ollama pull qwen3.5:9b if you have 16 GB RAM, or ollama pull qwen3.5:4b on tighter hardware.
  • Use ollama list to see what is installed.

10-minute local LLM coding setup checklist

  1. Install VSCodium.
  2. Install Ollama.
  3. Pull a model (start small if your laptop is modest).
  4. Install the Continue extension.
  5. Point Continue to your local Ollama endpoint (http://localhost:11434).

Best models for coding with Ollama (2026)

Model Size Min RAM Best for Ollama pull
qwen3.5:4b 4B 8 GB Entry-level coding, short edits, multimodal prompts ollama pull qwen3.5:4b
gemma4:e2b E2B 8 GB Fast low-RAM assistant, notes, simple code explanation ollama pull gemma4:e2b
gemma4:e4b E4B 12 GB Better small-model reasoning and multimodal checks ollama pull gemma4:e4b
qwen3.5:9b 9B 16 GB Best balanced laptop-local coding default ollama pull qwen3.5:9b
deepseek-r1:8b 8B distilled 16 GB Local reasoning fallback for debugging and planning ollama pull deepseek-r1:8b
deepseek-r1:14b 14B distilled 24 GB Better reasoning if you can tolerate slower output ollama pull deepseek-r1:14b
qwen3.6:27b 27B 32 GB unified memory / 24 GB VRAM Strong laptop-workstation coding, Q4 recommended ollama pull qwen3.6:27b
qwen3.6:27b-coding-nvfp4 27B MLX 32 GB Apple Silicon Better Apple Silicon coding run if you can spare ~20 GB ollama pull qwen3.6:27b-coding-nvfp4
qwen3.6:35b-a3b 35B MoE 32 GB unified memory / 24 GB VRAM Higher-quality agentic coding with efficient active params ollama pull qwen3.6:35b-a3b
qwen3.6:35b-a3b-coding-nvfp4 35B MoE MLX 32 GB Apple Silicon Practical MLX coding choice around ~22 GB ollama pull qwen3.6:35b-a3b-coding-nvfp4
gemma4:26b 26B MoE 32 GB unified memory / 24 GB VRAM Gemma 4 local agent workflows and multimodal reasoning ollama pull gemma4:26b

Recommended starting point: qwen3.5:9b on 16 GB RAM. It is the current practical balance for a private coding assistant: small enough for laptops, fast enough to keep open all day, and good enough for refactors.

No GPU or tight on RAM? Start with qwen3.5:4b or gemma4:e2b. The goal is a fast local assistant you will actually use, not a huge model that makes every prompt feel like a chore.

Have 32 GB+ unified memory or a 24 GB GPU? Use qwen3.6:27b first. If you mostly want compact latency, try qwen3.6:35b-a3b; if you want multimodal reasoning and Google ecosystem support, try gemma4:26b. These are MacBook Pro / gaming laptop / mini-workstation picks, not 8 GB ultrabook picks.

What about DeepSeek? DeepSeek's newest official API line is V4-Pro/V4-Flash, and V3.2 remains an open large-model release with strong agent and reasoning performance. Those are not realistic home-laptop local pulls. For a laptop-local DeepSeek choice, use the distilled deepseek-r1:8b or deepseek-r1:14b tags.

Apple Silicon and MLX model tags

If you are on a MacBook, do not ignore the MLX tags. MLX is Apple's machine-learning stack for Apple Silicon, and many Ollama model pages now include MLX-specific variants. They are not magically smaller, but they can be a better fit for M-series unified memory than a generic GGUF tag.

My practical MacBook rule, from running this setup on Apple Silicon:

  • 8 GB MacBook Air: stay with qwen3.5:4b, gemma4:e2b, or other small Q4 models. MLX BF16 tags are usually too heavy.
  • 16 GB MacBook Air/Pro: use qwen3.5:9b or gemma4:e4b as the daily assistant. Keep context around 8k-16k unless you enjoy memory pressure.
  • 32 GB MacBook Pro: this is where MLX gets interesting. Try qwen3.6:27b-coding-nvfp4 or qwen3.6:35b-a3b-coding-nvfp4 before jumping to huge BF16 tags.
  • 64 GB MacBook Pro/Studio: BF16 MLX tags such as qwen3.6:27b-mlx-bf16, qwen3.6:35b-a3b-mlx-bf16, or gemma4:26b-mlx-bf16 become plausible, but they are still workstation choices.

For daily coding on my Mac, I keep the model small enough that it answers quickly and stays in the loop. A huge model that makes every prompt slow stops being useful, even if the benchmark looks better.

Performance optimisation: quantisation and GPU offloading

Most guides skip this section. That is a mistake — these two levers account for the biggest real-world speed differences.

Quantisation explained

A model's weights are normally stored as 16-bit or 32-bit floats. Quantisation reduces this to 4-bit or 8-bit integers, trading a small amount of accuracy for dramatically lower memory use and faster inference.

Ollama uses GGUF format, which encodes quantisation in the model name:

Quantisation Memory use Quality loss When to use
Q8_0 ~1× the Q4 size Negligible You have headroom and want max quality
Q5_K_M ~1.25× Q4 Very small Best quality/speed tradeoff when RAM allows
Q4_K_M Baseline Small for coding tasks Default choice for most setups
Q3_K_M ~0.75× Q4 Noticeable Only when RAM is the hard constraint
Q2_K ~0.6× Q4 Significant Last resort — avoid for serious use

For coding tasks specifically, Q4_K_M is the practical floor. Below Q4, models start hallucinating APIs and method signatures at noticeably higher rates.

To pull a specific quantisation:

ollama pull qwen3.6:27b

Check what is available for a model:

ollama ls
# or browse https://ollama.com/library/<model-name> for all tags

GPU layer offloading

This is the biggest performance lever most people never configure. By default, Ollama detects your GPU and offloads as many transformer layers as fit in VRAM. But you can tune this explicitly.

Why it matters: each layer moved to GPU is orders of magnitude faster than CPU inference. A model with 32 layers where only 20 fit in VRAM still runs much faster on those 20 layers than pure CPU.

Set the number of GPU layers in Ollama via the Modelfile or the environment variable:

# Run a model with explicit GPU layer count
OLLAMA_NUM_GPU=24 ollama run qwen3.5:9b

Or create a custom Modelfile for persistent settings:

FROM qwen3.5:9b
PARAMETER num_gpu 24
PARAMETER num_ctx 8192
PARAMETER num_thread 8

Then build it:

ollama create myqwen -f Modelfile
ollama run myqwen

Practical GPU offloading guide:

VRAM available Layer strategy
4 GB Offload part of a 4B-9B model, rest on CPU
8 GB Offload a 4B-9B model, or partial layers of a 14B model
12 GB Full 9B if quantized, or a comfortable 4B/E4B setup
16 GB+ Full 9B, 8B reasoning distills, or partial 14B
24 GB Full 14B, or cautious 26B/27B at Q4 with reduced context

Context window tuning

The context window (num_ctx) directly affects VRAM use. A context of 128k tokens needs significantly more memory than 8k. For most coding tasks, 8k–16k is plenty:

PARAMETER num_ctx 8192

Reduce this first if you are hitting OOM errors before reducing quantisation.

Thread count for CPU inference

If you are running primarily on CPU, set num_thread to your physical core count (not threads):

PARAMETER num_thread 8

Hyperthreading hurts inference throughput. Physical cores only.

Recommended workflow for speed and quality

Here is a simple workflow that keeps things fast and surprisingly effective:

  • Use a smaller model for quick completions and short edits.
  • Use a larger model only when you need deeper reasoning.
  • Limit context to open files and target folders, not your whole disk.
  • Add a short rules prompt to keep changes clean and predictable.

Suggested rules prompt:

follow existing project style
do not invent APIs
prefer minimal diffs
ask when unsure

Use AGENTS.md for consistent results

Local agents work best when they have a short, consistent rule set. Many teams keep an AGENTS.md file in the repo root so the assistant sees the same guidance every time. Example template:

# AGENTS.md

## Goals
- Keep changes minimal and focused.
- Preserve existing code style and structure.
- Ask before making broad refactors.

## Behavior
- Do not invent APIs or dependencies.
- Prefer small, testable diffs.
- Call out assumptions and missing context.

## Context
- Use only files referenced in the prompt unless told otherwise.
- For large changes, propose a plan before editing.

This helps your local assistant behave consistently across tasks and teammates.

Limitations to expect

If you want the honest version, here it is:

  • Hardware matters: RAM and VRAM limit which models you can run.
  • Expect tradeoffs between speed, quality, and context length.
  • Local models still need clear prompts and scoped tasks.

Alternatives if you prefer other editors

  • Neovim with local LLM tooling
  • JetBrains IDEs with Continue
  • Other local assistants that support Ollama

Bottom line

For most teams, VSCodium + Continue + Ollama is the right local LLM coding setup: private by default, zero subscription cost, and trivial to reconfigure as better models appear. Add cloud models only when you genuinely need their extra capability.

Frequently asked questions

Can I use Ollama without a GPU?

Yes. Ollama runs on CPU-only machines. On 8 GB RAM without a GPU you can run 3B–4B models at 2–6 tokens/sec. Slower than cloud, but fully private and sufficient for completions and short edits.

Is a local LLM as good as GitHub Copilot?

For daily coding tasks - completions, refactors, code explanation - current local models like qwen3.5:4b, gemma4:e4b, and qwen3.6:27b cover a large share of Copilot-style work with zero subscription cost and full privacy. Cloud coding agents still win on very large repo context and hard multi-file planning. For teams who prioritise data privacy, this local LLM coding setup is a strong GitHub Copilot alternative.

What is the best model for coding with Ollama?

On 8 GB RAM, start with qwen3.5:4b or gemma4:e2b. On 16 GB RAM, qwen3.5:9b or gemma4:e4b is the balanced pick. On a 32 GB MacBook Pro or a laptop with a large GPU, qwen3.6:27b, qwen3.6:35b-a3b, or gemma4:26b becomes realistic at Q4 with a sensible context window. DeepSeek V4 is not a normal home-laptop local model; for local DeepSeek, use deepseek-r1:8b or deepseek-r1:14b.

Does VSCodium work with Continue.dev?

Yes. Continue is published to the Open VSX marketplace that VSCodium uses instead of the Microsoft marketplace. Install it from the Extensions panel inside VSCodium and point it at your Ollama endpoint (http://localhost:11434).

For production automation use cases, see our guide on running a self-hosted LLM in production with Ollama and n8n.

Ready to go further?

This is how we build at Vasilkoff: privacy-first, open-source friendly, and practical. If your team needs a private AI coding assistant integrated into your development workflow — or broader AI development services — we can help. Reach out via our contact page.

Last updated: