LLM Consolidation
Send your transcript to Claude or a local Ollama model to generate polished, structured study notes alongside the verbatim transcript.
What Consolidation Does
The raw transcript produced by WhisperKit is accurate but verbatim — it captures exactly what was said, including filler words, repeated phrases, and the natural looseness of spoken language. LLM Consolidation sends this transcript to a language model and asks it to produce structured Markdown study notes: headings organized by topic, bullet-point summaries, and key takeaways.
The consolidated notes appear as a separate "Consolidated Notes" tab in the transcript panel and in the exported HTML page. They sit alongside the original verbatim transcript — the raw transcript is never replaced.
You can annotate consolidated notes independently of the transcript (highlights, margin notes, inline notes all work in both views).
Choosing a Mode
The mode picker on the Pipeline Step 3 row is a segmented control with three options:
- Off — No LLM is called. The pipeline skips Step 3 entirely and proceeds directly to export. This is the default.
- Claude — Uses the Anthropic API. Your transcript is sent to Anthropic's servers over HTTPS. Requires an API key.
- Ollama — Uses a locally running Ollama instance at
localhost:11434. Fully on-device. Requires Ollama to be installed and a model pulled.
The mode is stored per lecture. Changing the mode on a completed lecture does not automatically re-run consolidation — it only affects the next time you click Begin Processing or trigger consolidation manually.
Claude Setup
To use Claude consolidation, you need an Anthropic API key. Keys are available at console.anthropic.com.
Go to Harvestry → Settings and click the Consolidation tab.
Paste your Anthropic API key into the API Key field. The key is stored securely in the macOS Keychain — it is never stored in plain text.
After the API key is saved, Harvestry fetches your available models from the Anthropic API and shows them in a dropdown. Choose a model. Click the refresh button next to the picker to update the model list if your account gains access to new models.
Claude Model Recommendations
The model list is fetched live from the Anthropic API, so it reflects whatever your account has access to. General guidance:
| Model Family | Best For | Cost |
|---|---|---|
| Claude Sonnet | Recommended for most use. Excellent balance of quality and speed. Produces well-structured notes with good comprehension of technical content. | Moderate |
| Claude Haiku | Fastest and cheapest. Good for straightforward lectures with clear structure. May miss nuance in dense academic content. | Low |
| Claude Opus | Highest quality. Best for highly technical, multi-topic, or non-English content where maximum comprehension matters. | High |
Ollama Setup Pro
Ollama runs language models entirely on your Mac using Apple Silicon's unified memory. No data leaves the device at any point.
Download Ollama from ollama.ai and install it. Ollama runs as a background service on localhost:11434. You can verify it is running by opening http://localhost:11434 in a browser — it should show "Ollama is running".
Open Terminal and pull a model suited to your Mac (see recommendations below). For example:
The first pull downloads the model weights and may take a few minutes depending on size. Subsequent loads are instant.
In the Pipeline Step 3 picker, select Ollama. Harvestry queries localhost:11434/api/tags and shows all installed models in a dropdown. Click the refresh button to re-query if you pull new models while Harvestry is already open.
http://localhost:11434.
Choosing an Ollama Model Pro
Not all models are equally suited to lecture summarization. Two things matter most for this use case:
- Context window. A 60-minute lecture transcript is typically 12,000–15,000 tokens. A 90-minute lecture can reach 20,000+. Models with short context windows (under 32K) will silently truncate the transcript, producing notes that only cover the first portion of the lecture. Always prefer models with 128K+ context.
- Instruction following. Harvestry's consolidation prompt asks for structured Markdown with specific headings and sections. Models that follow formatting instructions reliably produce cleaner, more usable notes.
Apple Silicon's unified memory architecture means your Mac's RAM is the key constraint — the model weights load directly into the memory pool shared by CPU and GPU. As a rule of thumb, a model requires roughly 1.2× its on-disk size in RAM to run comfortably. Any model that would push your system past ~80% total RAM usage will swap to SSD and run unusably slowly.
8 GB RAM (M1 / M2 MacBook Air, Mac mini)
With 8 GB unified memory, you can comfortably fit models up to about 4–5 GB on disk. Larger models will swap to SSD and run too slowly to be practical.
| Model | Pull Command | Disk | Context | Notes |
|---|---|---|---|---|
| Qwen3 4B ★ | ollama pull qwen3:4b |
2.5 GB | 256K | Best structured-output quality at this tier. Handles long transcripts without truncation. |
| Phi-4 Mini | ollama pull phi4-mini |
2.5 GB | 128K | Very fast generation. Slightly less consistent formatting than Qwen3 4B. |
| Gemma4 E2B | ollama pull gemma4:e2b |
7.2 GB | 128K | Technically fits on 8 GB but leaves little headroom — may swap. Not recommended for this tier. |
16 GB RAM (MacBook Pro M2 / M3, Mac mini M4)
16 GB unlocks 8–14B models and is the sweet spot for most users. You get genuinely useful note quality at comfortable speeds.
| Model | Pull Command | Disk | Context | Notes |
|---|---|---|---|---|
| Qwen3 14B ★ | ollama pull qwen3:14b |
9.3 GB | 256K | Top recommendation at this tier. Excellent instruction following. Fits with ~5 GB to spare. |
| Qwen3 8B | ollama pull qwen3:8b |
5.2 GB | 256K | Faster than 14B with slightly lower quality. Good starting point. |
| Qwen3 30B (MoE) | ollama pull qwen3:30b |
19 GB | 256K | MoE model (30B-A3B — 3B active parameters). Reportedly outperforms QwQ-32B at a fraction of the inference cost. Excellent if you want to push quality at this tier. |
| Gemma4 E4B | ollama pull gemma4:e4b |
9.6 GB | 128K | Google's Mixture-of-Experts model. Strong summarization quality; comparable to Qwen3 8B. |
32 GB RAM (MacBook Pro M3 Pro / Max, Mac Studio M2)
32 GB is where on-device models become genuinely excellent. Both Gemma4 26B and Qwen3 32B produce output that holds up well against Claude API quality.
| Model | Pull Command | Disk | Context | Notes |
|---|---|---|---|---|
| Gemma4 26B ★ | ollama pull gemma4:26b |
18 GB | 256K | Google's flagship MoE model. Runs efficiently — only ~3.8B parameters are active per token, so it's faster than a dense 18B model while drawing on 26B of total knowledge. Excellent prose and structure. |
| Qwen3 32B | ollama pull qwen3:32b |
20 GB | 256K | Dense 32B model. Slightly slower than Gemma4 26B but outstanding at following complex formatting instructions. |
| Qwen3.6 35B | ollama pull qwen3.6:35b |
24 GB | 256K | Newest Qwen generation (post-3.5). Efficient architecture — 35B total parameters, fast inference. Excellent for long lectures. |
| Mistral Small | ollama pull mistral-small:22b |
13 GB | 128K | 22B model with a compact 13 GB footprint. Fast generation, clean prose. Good if you want to leave more RAM free for other apps. |
64 GB+ RAM (Mac Studio M2 Ultra / M4 Max, Mac Pro)
On high-memory machines, local models match or exceed what you'd get from most commercial APIs.
| Model | Pull Command | Disk | Context | Notes |
|---|---|---|---|---|
| Qwen3.6 35B ★ | ollama pull qwen3.6:35b |
24 GB | 256K | Newest Qwen generation (post-3.5). 256K context handles the longest lectures comfortably. At 24 GB it leaves abundant headroom on a 64 GB machine. Top recommendation for this tier. |
| Llama 3.3 70B | ollama pull llama3.3 |
43 GB | 128K | Meta's dense 70B. Highest raw parameter count that fits at this tier. Strong quality but limited to 128K context — may truncate very long lectures. |
| Gemma4 31B | ollama pull gemma4:31b |
20 GB | 256K | Google's dense 31B. Fast inference, excellent quality, and 256K context. Good alternative if you prefer a Google model. |
gemma4:26b is an 18 GB download but runs like a sharp 4B model with 26B of knowledge behind it.
The Consolidation Prompt
The system prompt sent to the LLM instructs it how to format the output. The default prompt asks the model to produce structured Markdown with:
- A brief summary paragraph at the top
- Headings for each major topic covered
- Bullet-point summaries under each heading
- A "Key Takeaways" section at the end
You can customize the prompt in Settings → Consolidation → System Prompt. The text editor in Settings accepts any Markdown-aware instructions. A Reset to Default button restores the original prompt.
Per-Lecture Control
The consolidation mode (Off / Claude / Ollama) and model selection are stored per lecture. You can have different lectures using different modes:
- Some lectures consolidated with Claude Sonnet for maximum quality
- Others consolidated with Ollama for privacy
- Others with consolidation off for quick reference
Changing the mode on a pending lecture takes effect when you next click Begin Processing. Changing the mode on a completed lecture triggers an alert asking if you want to generate or regenerate consolidated notes now.
Disabling After the Fact
To remove consolidated notes from a completed lecture:
- On the Step 3 row, switch the mode picker from Claude or Ollama to Off.
- A confirmation alert appears: "Remove consolidated notes from this lecture? The HTML export will be regenerated without the notes tab."
- Confirm. The consolidated notes are deleted and the HTML is re-exported.
Re-running Consolidation
If a lecture is already complete and you switch its consolidation mode to Claude or Ollama, Harvestry shows an alert: "Generate consolidated notes now?" Confirming runs only Step 3 and then re-exports the HTML. The transcript and screenshots are unchanged.
Privacy
The privacy implications differ significantly between modes:
- Off — No LLM call. Nothing leaves the device.
- Claude — Your transcript text is sent to Anthropic's API servers over HTTPS. This is the one intentionally off-device step in Harvestry's pipeline. Anthropic's data handling is governed by their Privacy Policy.
- Ollama — Fully on-device. The transcript is sent over localhost to the Ollama daemon running on your Mac. No data leaves the device.
Consider using Ollama mode for lectures containing sensitive, confidential, or personally identifiable content.