🧑🏻‍💻 Run AI on Your Own Computer: Local Models, Privacy & the Tools I Actually Use

7 Mar

Part of the Digital Alchemy series — DIY tech, open systems, and tools that respect your agency.

Before we dive in, this post covers a lot of ground. Here's a quick map of what we'll touch on, with a full glossary at the bottom if any terms trip you up along the way:

Key AI terms you'll encounter: LLM, parameters, tokens, context window, RAM, unified memory, quantisation, inference, MLX, GGUF. Each gets a brief explanation when it first appears, and a fuller definition in the glossary at the end.

Why I Made This

This tutorial grew out of something a bit sideways. I was filming a tarot reading and found myself going on a long tangent about AI — the ethics of it, the privacy implications, what using it makes you complicit in. That tangent turned into this video. Then the video became the base we used to demonstrate how to locally extract a transcript in the tutorial this extended post is based on.

So, it's recursive. That feels right.

If you want the longer version of my ethical thinking around AI — including the environmental costs, the data scraping debate, and where I've landed personally — I've already written about it in AI as a Magic Mirror: Ethics, Tips & Disclaimer and you can also hear me talk through it in the tarot reading that started all this. This post is mostly a practical walkthrough, but I didn't want to skip past the why entirely.

Short version: local AI lets you keep your data on your own device, reduce your environmental footprint, and understand what's actually happening under the hood. That's worth knowing about regardless of where you land ethically on AI generally.

What You'll Need To Run It Yourself

A computer with a reasonably modern processor (M-series Mac ideal for deeper exploration, but not required)
Alternatively, you may have better success running a SLM (small local model) on a modern smartphone. More below.
At least 8GB of RAM — 16GB gives you more options
A few gigabytes of free storage space
Audacity (free) + Muse Hub (free)
LM Studio (free) — or Ollama if you prefer a more technical setup
Some patience while models download and load

No coding required. No subscriptions. No data leaving your machine.

Part One: Getting Your Transcript — Audacity + Whisper

What is Audacity?

Audacity is a free, open source audio editor that's been around since 2000. It runs on Mac, Windows, and Linux. I use it to record and edit audio for my videos, and it's one of my go-to open source tools — I've written more about why I use open source software in Open Source, Open Future if you want the bigger picture.

For this workflow, we're using Audacity not just to edit audio but to run Whisper — an AI transcription model — directly on your computer, no internet required.

What is Whisper? It's a speech-to-text model developed by OpenAI and released as open source. You feed it an audio or video file, and it outputs a written transcript, including timestamps. The clever part for our purposes: because it's open source, other developers have built it into tools like Audacity so you can run it locally without sending anything to OpenAI's servers.

Installing Audacity and OpenVino

Download Audacity from audacityteam.org — make sure you download the version with Muse Hub, not without it. Muse Hub is the plugin manager that gives you access to the AI features.
Once Audacity is installed, open Muse Hub and search for OpenVino AI Tools. Install it.
Restart Audacity. You should now see OpenVino AI Effects listed under the Effect menu.

Running the Transcription

Drag your video or audio file directly into Audacity. It'll import just the audio track.
Go to Select → All (or ⌘A on Mac / Ctrl+A on Windows) to select the full track.
Go to Effect → OpenVino AI Effects → Whisper Transcription.
Settings: I leave it on Base model, Transcribe, and English (or Auto if your content is multilingual). Hit Apply.
Depending on your file length, transcription usually takes a few minutes. On my M3 MacBook Air with 16GB, a 20-minute video processes in under 5 minutes — noticeably faster than real time. Results will vary depending on your hardware.
Once it finishes, you'll see the transcript appear as labels on the timeline — these are timestamped text segments.
Go to Edit → Labels → Label Editor to see the full output, then Export to save it as a plain text file.

You now have a timestamped transcript of your video, generated entirely on your own device.

Part Two: Running a Local Language Model — LM Studio

What is an LLM?

LLM stands for Large Language Model. It's the type of AI that powers things like ChatGPT, Claude, and Gemini. These models are trained on massive amounts of text data and learn to predict what words should follow other words — which, at scale, produces remarkably coherent, contextual responses. When people say "AI" in conversation these days, they usually mean an LLM.

Running one locally means the model lives on your hard drive and runs on your own hardware. Nothing is sent to a server. You can use it with aeroplane mode on.

What is LM Studio?

LM Studio is a free desktop application that lets you download, manage, and chat with local LLMs through a clean interface — no command line needed. It's available for Mac, Windows, and Linux, and it's genuinely beginner-friendly while still offering plenty of depth for more technical users.

Download it from lmstudio.ai.

Installing a Model

Once LM Studio is open, click Model Search in the left sidebar. This will show you a library of available models, with LM Studio doing its best to surface ones compatible with your hardware.

Before you download anything, it helps to understand a couple of terms:

Parameters — When you see a model described as "7B" or "8B", the B stands for billion parameters. Parameters are essentially the internal settings the model adjusts during training to learn patterns in language. More parameters generally means more capable — but also more RAM required. Think of it like the difference between a small notebook and an encyclopaedia: more information, but heavier to carry.

Quantisation — Models can be compressed to take up less space and RAM at the cost of some precision. A "Q4" quantisation means the model's weights have been compressed to 4-bit precision. This is why a "7B" model might only be 4–5GB on disk rather than the 14GB you'd expect. For most personal use cases the quality difference is minimal. LM Studio will show you the file size before you download.

GGUF vs MLX — These are two different model formats:

GGUF is a general-purpose format that works across Mac, Windows, and Linux.
MLX is a format developed specifically for Apple Silicon — it's optimised to take full advantage of the M-series chip architecture. If you're on an M-series Mac, always prefer MLX versions when available. They'll be noticeably faster.

Part Three: Picking the Right Model

Why RAM Matters — And Why It's More Complicated on PC

When running a local LLM, the entire model needs to be loaded into memory so your processor can work with it. But which memory depends on your setup — and this is where Mac and PC users have a meaningfully different experience.

On an M-series Mac, there's only one pool of memory to think about: unified memory. The CPU, GPU, and RAM all share the same chip and the same memory pool. When LM Studio loads a model, it draws from that single pool and distributes the work across CPU and GPU automatically. You only need to think about one number — your total RAM — and whether the model fits inside it.

On a PC, it's more complicated. Graphics cards (GPUs) have their own dedicated memory, called VRAM. Local AI models run much faster on a GPU than a CPU — but the model has to fit inside the GPU's VRAM to get that benefit. A typical mid-range gaming GPU might have 8–12GB of VRAM. If your model is larger than that, LM Studio will either refuse to fully load it onto the GPU, or split it between VRAM and system RAM — which works, but is noticeably slower.

So on PC, you're juggling two numbers:

System RAM — your computer's main memory (16GB, 32GB etc.)
VRAM — your graphics card's dedicated memory (often 6–16GB on consumer cards)

For best performance on PC, you want the model to fit entirely within your GPU's VRAM. If it doesn't, it'll still run via CPU and system RAM — just slower.

Practical PC guide:

6–8GB VRAM: Comfortable with Q4 quantised 7B models. Tight on 8B.
12GB VRAM: Runs most 7B–13B models comfortably.
24GB VRAM: Opens up larger models and faster inference.

This is one of the reasons M-series Macs are unusually competitive for local AI at their price point — 16GB of unified memory on a MacBook Air effectively acts like 16GB of "GPU memory" for these workloads, which on a PC would require a dedicated high-end graphics card costing significantly more than the laptop itself.

Models I Recommend

Gemma 3 — Developed by Google DeepMind and released as open weights. Gemma is one of my favourites for general use — it's well-balanced, follows instructions reliably, and the smaller variants (4B) run well on 16GB. A good starting point.

Qwen — Developed by Alibaba and released as open source. Qwen models are excellent, particularly for longer context tasks. The Qwen 2.5 and Qwen 3 series are worth trying. I use these regularly.

GPT-OSS — OpenAI's open-weight model, released in 2025. This is as close as you'll get to ChatGPT quality running locally. The trade-off: it barely fits on a 16GB machine, and sometimes it will crash mid-session if other apps are also using RAM. Close everything else first. Worth trying if you have 16GB and want to see what local AI can really do.

My general advice: Start with Gemma 3 4B or Qwen 4B, see how it goes, then step up to 8B if you have 16GB. The smaller models are genuinely capable for most tasks — title generation, summarisation, drafting, editing.

Context Window — How Much Can It Hold?

The context window is how much text the model can hold in its "working memory" at once — including both your input and its output. It's measured in tokens (roughly ¾ of a word each, so 1000 tokens ≈ 750 words).

If you paste in a long transcript and the model starts producing strange, repetitive, or off-topic responses, you've likely exceeded its context window. LM Studio shows you context usage as a percentage while the model is running — keep an eye on it. For processing long transcripts, you'll want a model with at least a 4K–8K context window; many modern models now support 32K or more.

Part Four: Alternatives Worth Knowing About

Ollama

Ollama is a tool for downloading and running local models that's extremely lightweight, actively maintained, and popular with developers. It does have a desktop app with a GUI, though its real power is in how it works as a background service — you run models through Ollama and then connect other interfaces to it.

This is actually how I prefer to use it: Ollama runs on my MacBook, serving the model locally, while Open WebUI runs on my Proxmox home server and connects to it over the network. That means I get a polished browser-based chat interface hosted on my server, but all the actual AI processing happens on my MacBook where the RAM is. It's a neat split — the server handles the UI and conversation history, the laptop handles the heavy lifting.

If that sounds like too much, LM Studio is the simpler starting point. But if you want to understand how these pieces connect, Ollama is worth knowing about.

Open WebUI

Open WebUI is a self-hosted, browser-based interface for interacting with local models — think ChatGPT's interface but running entirely on your own machine. It connects to Ollama (or LM Studio's local server) and gives you a polished chat experience, conversation history, and more. If you want something that feels like a proper AI app but lives entirely on your hardware, this is it. This will eventually get its own dedicated tutorial.

PocketPal (Mobile)

PocketPal is an iOS and Android app that lets you run small language models directly on your phone. Once the models have downloaded (typically 2–4GB depending on what you choose), no internet connection is required — you can run it entirely offline, including in aeroplane mode.

What I Actually Did With the Output

In the tutorial video, I pasted the full 20-minute tarot reading transcript into LM Studio and asked it to generate title suggestions and a YouTube description. Results were mixed in interesting ways — it picked up real phrases from the video, but didn't always follow the system prompt precisely, occasionally made things up, and gave me more output than I asked for.

That's normal. Local models hallucinate more than cloud models. They don't always follow instructions as consistently. The output is a starting point, not a finished product. I take the bits that resonate, edit them, and put them together myself. The model does the heavy lifting on the first draft; I make it mine.

System Prompts — and Why They Matter

A system prompt is the set of instructions you give a model before any conversation begins. Think of it as briefing a new assistant: here's what I need, here's how I want it formatted, here's what to avoid. The model will try to follow these instructions for every response in that session.

In LM Studio, access your system prompt via Edit System Prompt in the top menu. In PocketPal, each "Pal" (saved persona/configuration) has its own system prompt you can customise.

One important caveat for smaller, local models: keep your system prompt concise. Smaller models have a limited context window (more on that in the glossary), and a long system prompt eats into that budget before you've typed a single word. If you find a model is going off-script or ignoring your instructions, your prompt may be too long or complex for it to hold in memory reliably. With smaller models, fewer, clearer instructions consistently outperform longer, more detailed ones.

With larger cloud-based models — Claude, ChatGPT, Gemini — you have much more room to work with. You can write detailed multi-paragraph system prompts, include examples, specify edge cases, and the model will generally follow them faithfully throughout a long conversation.

Writing a good system prompt: Each major AI company publishes a prompting guide for their models. It's worth reading the one relevant to whatever you're using — they're often surprisingly practical. A useful trick: download the PDF or paste the link into the AI itself, describe what you want your system prompt to do, and ask it to write one for you based on its own guidelines. Meta, but it works.

Here's the system prompt I use in PocketPal for generating titles and descriptions from my tarot reading transcripts — it's a good example of keeping things tight and specific for a smaller model — expand below.

You are an assistant who can help me create engaging titles and descriptions for my YouTube and social media posts, based on my tarot reading transcripts.
Here's what I need:
Australian English: Output in this language only (e.g. Realise instead of Realize).
5 Title Suggestions: Keep them short, catchy, and under 100 characters. Include 1-2 emoji.
1 Long Description: This should be 1-2 paragraphs, focusing on the key takeaways from the reading.
No Card Names: Please avoid referencing specific tarot card names.
No Notes: No extra advice, just the title and description.
Hashtags: Suggest 5x relevant hashtags.
Here's what I'm working with:

Notice what makes this work: it's structured, specific, uses bold labels so the model can parse each requirement clearly, and ends with an open prompt ready to receive the transcript. The model being used here is Gemma 2 2B (Q6_K quantisation) — a very small model by current standards, and it handles this prompt reliably because the instructions are unambiguous.

Putting It All Together: The Workflow

Record your video or audio
Import into Audacity → run Whisper Transcription → export as text
Open LM Studio → load your preferred model
Set your system prompt with instructions and examples
Paste your transcript into the chat
Review, edit, and take what's useful from the output

Total cost: $0. Total data sent to external servers: none.

Glossary

Context window — The total amount of text (measured in tokens) a model can hold in active memory at once, including your input and its responses. If you exceed it, quality degrades.

GGUF — A general-purpose model format that works across Mac, Windows, and Linux. Less optimised than MLX on Apple hardware, but more universally compatible.

Hallucination — When a model generates plausible-sounding but incorrect or fabricated content. More common in smaller local models. Always verify output.

Inference — The process of actually running a model to generate a response. "Inference speed" means how quickly the model produces output.

LLM (Large Language Model)— The type of AI behind ChatGPT, Claude, and similar tools. Trained on massive text datasets to generate coherent, contextual language. When we talk about "running AI locally," we mean running one of these.

MLX — Apple's open source machine learning framework, optimised for M-series chips. Models in MLX format run significantly faster on Apple Silicon than general-purpose formats.

Parameters — The internal numerical values a model learns during training. "7B" means 7 billion parameters. More parameters generally means more capable, but also more resource-hungry.

Quantisation — A compression technique that reduces a model's file size and RAM requirements by storing its parameters at lower numerical precision. Q4 = 4-bit quantisation. Quality trade-off is usually minor for everyday use.

RAM (Random Access Memory) — Your computer's short-term working memory. Local AI models need to fit into RAM to run. More RAM = bigger models.

System prompt — Instructions given to a model before the conversation begins. Used to set tone, format, persona, and task parameters.

Tokens — The units a language model actually processes. Not quite words — punctuation, spaces, and common word fragments each count as tokens. Roughly 1000 tokens ≈ 750 words.

Unified memory — Apple's term for the shared memory pool in M-series chips, where CPU, GPU, and RAM all live on the same chip. Makes data transfer much faster, which is why M-series Macs are unusually good at local AI for their price.

VRAM (Video RAM) — Dedicated memory built into a graphics card (GPU). Local AI models run fastest when they fit entirely within VRAM. Not a concern on M-series Macs, where GPU and system memory share the same unified pool.

Whisper — OpenAI's open source speech-to-text model. Converts audio to written transcripts locally via Audacity + OpenVino.

This post is part of the Digital Alchemy series on Fires of Alchemy. More tutorials on open source tools, self-hosting, and using technology with intention are on the way — comment below to request particular topics or expanded explainers.

Nicholas Robinson