Getting Gemma 4 Running in VS Code: A Complete Setup Guide

Every developer asks the same question: "Can I get a real AI coding assistant without paying $19 a month or uploading my code to someone else's servers?" As of mid-2026, the answer is yes—and Gemma 4 running locally through Ollama inside VS Code is the best way to do it. This isn't a toy. It's not a compromise. It's a genuinely useful coding companion that runs entirely on your machine.

This guide comes from daily use over a month across Flutter, Python, and TypeScript projects—not a quick experiment, but a real workflow built on Gemma 4. Multiple ways exist to connect Gemma 4 to VS Code, each with specific setups that keep things fast instead of frustrating. If you haven't installed Gemma 4 yet, start with a beginner's guide on running it locally, then come back here once Ollama is working.

Before You Start: Is Your Hardware Strong Enough?

Gemma 4 runs entirely on your own hardware—there's no cloud fallback. If your machine isn't powerful enough, you'll hit sluggish completions that interrupt your workflow instead of helping it. Here's what you actually need:

  • Windows or Linux with NVIDIA GPU: 8GB VRAM is the realistic starting point for gemma4:e4b, the best default tag for most developers. If your GPU only has 4-6GB VRAM, start with gemma4:e2b and expect simpler suggestions.
  • Mac with Apple Silicon: Any M1, M2, M3, or M4 chip with 16GB unified memory handles gemma4:e4b comfortably. With 24GB+ unified memory, you can try gemma4:26b, and 32GB+ gives you a real shot at the full gemma4:31b model.
  • No dedicated GPU: gemma4:e2b can still run on CPU, but expect 2-5 seconds per completion instead of under a second. You need a minimum 8GB RAM, ideally 16GB. This setup works for chat workflows but gets annoying for inline text autocomplete.

Quick hardware check: On Windows, press Ctrl+Shift+Esc to open Task Manager, go to Performance > GPU—look for Dedicated GPU memory. On Mac, click the Apple menu and check About This Mac to see chip and memory info. On Linux, run nvidia-smi in your terminal.

Installing VS Code (Skip If You Have It)

Already running VS Code? Jump straight to the Ollama section below.

1. Visit code.visualstudio.com and download the installer for your operating system.

  • Windows: Run the .exe file. During setup, choose Add to PATH and Register Code as an editor for supported file types—both save you headaches later.
  • Mac: Unzip the download, drag VS Code to Applications. On first launch, right-click the icon and choose Open to bypass macOS Gatekeeper warnings.
  • Linux: Download the .deb package and run sudo dpkg -i code_*.deb, or install via snap with sudo snap install code --classic.

2. Open VS Code and press Ctrl+` (the backtick above your Tab key) to open the integrated terminal. You'll need this for the next steps.

Installing Ollama—The Engine Behind Everything

Ollama is the component that actually downloads and runs Gemma 4 on your machine. Think of it as a silent local server running in the background, waiting for VS Code extensions to send it prompts. Every method in this guide depends on it.

1. Visit ollama.com and download the installer.

  • Windows: Run the .exe file. After installation, Ollama starts automatically and appears as an icon in your system tray (bottom right, near the clock).
  • Mac: Open the .dmg file, drag Ollama to Applications, and launch it. You'll see its icon appear in the menu bar.
  • Linux: Run curl -fsSL https://ollama.com/install.sh | sh in your terminal. It automatically installs and runs as a background service.

2. Verify the installation: Open your terminal and run:

ollama --version

If you see a version number, it installed successfully. If you get "command not found," restart your terminal or reboot your computer.

3. Confirm the server is running: Visit http://localhost:11434 in your browser. You should see "Ollama is running." If not, relaunch the Ollama application from your Start menu or Applications folder.

Downloading Gemma 4—One Command, One-Time Download

This step downloads the model weights to your local drive. It happens once—after that, the model loads from memory in seconds each time you start coding.

As of this update, Google's Gemma 4 lineup includes E2B, E4B, 26B, A4B, and 31B. Ollama's Gemma 4 tags follow that naming, so use the specific tags below instead of older 12B or 27B references you might see elsewhere.

1. Open your command line (or use VS Code's built-in terminal with Ctrl+`).

2. Download the E4B model—the best balance of speed and quality for most developers:

ollama pull gemma4:e4b

3. Limited VRAM or CPU? Download the lightest official Ollama tag: ollama pull gemma4:e2b

4. Have 16GB+ VRAM or plenty of unified memory? Download the 26B mixture-of-experts model for noticeably stronger reasoning: ollama pull gemma4:26b

5. Have 24GB+ VRAM or 32GB+ unified memory? Download the top-tier 31B model: ollama pull gemma4:31b. If you want an explicit quantization tag, use ollama pull gemma4:31b-it-q4_K_M.

6. Verify the download: Run ollama list—your model appears with its size.

7. Quick test: Run ollama run gemma4:e4b to open a chat window. Ask something simple like "Write a hello world in Python." If you get working code back, everything's set up right. Type /bye to exit.

Testing Gemma 4 in the Ollama Desktop App (No VS Code Needed)

Recent Ollama builds come with a built-in desktop chat window—the fastest way to confirm your setup works before plugging anything into VS Code. If the desktop app talks to Gemma 4 well, every method below will also work, since they all connect to the same local Ollama server at localhost:11434.

  1. Open the Ollama application from your Start menu (Windows), Applications folder (Mac), or system tray icon.
  2. You'll see a minimal chat interface with a model picker in the bottom right corner. Click it and select your chosen variant—gemma4:e2b, gemma4:e4b, gemma4:26b, or gemma4:31b.
  3. Type a quick prompt like "Write a Python function that reverses a string" and hit Enter. Gemma 4 starts streaming a response within a second or two.
This is Ollama's built-in desktop chat app with Gemma 4 selected. If this works, every VS Code method below will work too—they all access the same local server.
This is Ollama's built-in desktop chat app with Gemma 4 selected. If this works, every VS Code method below will work too—they all access the same local server.

No chat window showing? You're running an older version of Ollama. Update to the latest from ollama.com—the desktop chat UI is built into every new install. The CLI command ollama run gemma4:e4b (above) still works on any version if you prefer terminal.

Method 1: Continue Extension—A Complete Copilot Replacement (Recommended)

This is the recommended approach for most developers. Continue gives you chat, inline code editing, and Tab autocomplete—essentially everything GitHub Copilot does, but pointing at your local Gemma 4 model. If you use Android Studio for Flutter work, the same Continue + Ollama setup works there too.

Setup

  1. In VS Code, press Ctrl+Shift+X (Cmd+Shift+X on Mac) and search for Continue. Install the version published by Continue.dev.
  2. Click the Continue icon on the left sidebar. The setup wizard launches and auto-detects Ollama—it lists every model you've downloaded. Select Ollama as your provider.
  3. If it asks you to sign in, click Skip or Use local models. You don't need an account for local use.
  4. Pick Gemma 4 from the model dropdown at the top of the chat window. Chat and inline editing work immediately after this step.

Enabling Tab Autocomplete (Critical—Disabled by Default)

Continue's chat and inline editing work out of the box, but Tab autocomplete isn't enabled by default. You need to configure it separately:

1. Open Continue's config file. Press Ctrl+Shift+P (Cmd+Shift+P on Mac), type Continue: Open Config, and select it. Newer Continue versions use config.yaml; older installs might still show config.json. The file lives in ~/.continue/ on Mac/Linux or C:\\Users\\YourName\\.continue\\ on Windows.

2. In your config.yaml file, add Gemma 4 to the models section and include roles:

name: Local Gemma 4
version: 0.0.1
schema: v1

models:
  - name: Gemma 4 E4B Chat
    provider: ollama
    model: gemma4:e4b
    roles:
      - chat
      - edit
      - apply

  - name: Gemma 4 E2B Autocomplete
    provider: ollama
    model: gemma4:e2b
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 350
      maxPromptTokens: 1024

3. If your Continue installation still uses config.json, the old tabAutocompleteModel style might still work, but treat that as a legacy path and switch to YAML when the extension prompts you.

4. Save the file. Continue automatically reloads the config—no need to restart VS Code.

Tip: For faster autocomplete, keep a smaller model like gemma4:e2b dedicated to Tab completion while using gemma4:e4b, gemma4:26b, or gemma4:31b for chat. Speed matters more than quality for inline suggestions.

Three Shortcuts You'll Use Constantly

  • Chat about selected code: Highlight any code block and press Ctrl+L (Cmd+L on Mac). Ask things like "explain this," "find bugs," or "what happens if input is null?" You can also type @file or @codebase in the chat box to reference other files without manually pasting.
  • Edit code inline: Highlight code, press Ctrl+I (Cmd+I on Mac), and type a command—"add error handling," "convert to async/await," "add TypeScript types." You get a diff to review before accepting.
  • Tab autocomplete: Just start typing. Gray faded text appears after a short pause—press Tab to accept the suggestion or keep typing to dismiss it. Press Esc to close.

Troubleshooting

  • No suggestions or chat responses: Open http://localhost:11434 in your browser. If it doesn't say "Ollama is running," restart Ollama from your Start menu or Applications folder.
  • Tab autocomplete isn't showing: Make sure your config.yaml model includes the autocomplete role. If it doesn't, only chat and inline editing work.
  • Suggestions are very slow: Run ollama ps in your terminal. If the processor column shows cpu instead of gpu, switch to a smaller model like gemma4:e2b or update your GPU drivers.

Method 2: CodeGPT Extension—Best for Multi-Chat Workflows

If you spend more time asking about code than writing it—debugging, explaining legacy code, brainstorming architecture—CodeGPT deserves a look. It leans heavily into chat experience and has a cleaner conversation interface than Continue, though its inline editing is slower.

Setup

1. Press Ctrl+Shift+X (Cmd+Shift+X on Mac), search for CodeGPT, and install it.

2. Click the CodeGPT icon in the sidebar and select Ollama as your AI provider.

3. CodeGPT automatically scans for available local models. Pick Gemma 4 from the dropdown. If it doesn't appear, confirm Ollama is running with ollama list and click the refresh button.

4. Optional but recommended: Set this system prompt in CodeGPT's settings to tweak output quality:

You are an expert software developer. Write clean, well-structured code. When explaining, break it down step by step.

5. Test it: Ask:

Write a Python function that checks if a number is prime

If you get working code back, setup is done.

How to Use It

Highlight code in your editor, right-click, and you'll see CodeGPT context menu options—"Explain this code," "Find bugs," "Refactor," "Write tests." CodeGPT also saves your chat history between VS Code sessions, which helps when you're grinding through a multi-hour debugging problem.

Note: CodeGPT's Tab autocomplete with local models is significantly less reliable than Continue. If real-time inline suggestions matter to you, use Continue (Method 1) and save CodeGPT for chat.

Method 3: Ollama Extension—Minimal and Lightweight

If you just want a simple chat window to ask Gemma 4 questions without any bells and whistles, the standalone Ollama extension is fastest. No account, no config files, no learning curve.

Setup

  1. Press Ctrl+Shift+X, search for Ollama, and install the extension with the highest download count.
  2. Press Ctrl+Shift+P (Cmd+Shift+P on Mac), type Ollama, and select Ollama: Chat.
  3. Pick Gemma 4 from the model list. If the list is empty, Ollama isn't running—restart it.
  4. Test it: Ask:
What does the map function do in JavaScript?

If you get a coherent answer, you're done.

This extension offers nothing but a chat window—no inline autocomplete, no inline editing, no workspace indexing. That's the tradeoff for simplicity. It barely impacts VS Code performance, making it a good choice for older machines. For the full experience, use Continue (Method 1).


Description: Learn how to run Google's Gemma 4 AI locally in VS Code using Ollama. Free alternative to Copilot with full privacy and no subscriptions.

Related Articles