Harvestry Documentation LLM Consolidation

LLM Consolidation

Send your transcript to Claude or a local Ollama model to generate polished, structured study notes alongside the verbatim transcript.

What Consolidation Does

The raw transcript produced by WhisperKit is accurate but verbatim — it captures exactly what was said, including filler words, repeated phrases, and the natural looseness of spoken language. LLM Consolidation sends this transcript to a language model and asks it to produce structured Markdown study notes: headings organized by topic, bullet-point summaries, and key takeaways.

The consolidated notes appear as a separate "Consolidated Notes" tab in the transcript panel and in the exported HTML page. They sit alongside the original verbatim transcript — the raw transcript is never replaced.

You can annotate consolidated notes independently of the transcript (highlights, margin notes, inline notes all work in both views).

Choosing a Mode

The mode picker on the Pipeline Step 3 row is a segmented control with three options:

The mode is stored per lecture. Changing the mode on a completed lecture does not automatically re-run consolidation — it only affects the next time you click Begin Processing or trigger consolidation manually.

Claude Setup

To use Claude consolidation, you need an Anthropic API key. Keys are available at console.anthropic.com.

1
Open Settings → Consolidation

Go to Harvestry → Settings and click the Consolidation tab.

2
Enter your API key

Paste your Anthropic API key into the API Key field. The key is stored securely in the macOS Keychain — it is never stored in plain text.

3
Select a model

After the API key is saved, Harvestry fetches your available models from the Anthropic API and shows them in a dropdown. Choose a model. Click the refresh button next to the picker to update the model list if your account gains access to new models.

Claude Model Recommendations

The model list is fetched live from the Anthropic API, so it reflects whatever your account has access to. General guidance:

Model Family Best For Cost
Claude Sonnet Recommended for most use. Excellent balance of quality and speed. Produces well-structured notes with good comprehension of technical content. Moderate
Claude Haiku Fastest and cheapest. Good for straightforward lectures with clear structure. May miss nuance in dense academic content. Low
Claude Opus Highest quality. Best for highly technical, multi-topic, or non-English content where maximum comprehension matters. High
⚠️
API usage is billed by Anthropic. Consolidating a long lecture can consume a significant number of tokens (input + output). A 90-minute lecture transcript is typically 8,000–15,000 input tokens. Check your Anthropic dashboard to monitor usage.

Ollama Setup Pro

Ollama runs language models entirely on your Mac using Apple Silicon's unified memory. No data leaves the device at any point.

1
Install Ollama

Download Ollama from ollama.ai and install it. Ollama runs as a background service on localhost:11434. You can verify it is running by opening http://localhost:11434 in a browser — it should show "Ollama is running".

2
Pull a model

Open Terminal and pull a model suited to your Mac (see recommendations below). For example:

ollama pull qwen3:8b

The first pull downloads the model weights and may take a few minutes depending on size. Subsequent loads are instant.

3
Harvestry auto-detects your models

In the Pipeline Step 3 picker, select Ollama. Harvestry queries localhost:11434/api/tags and shows all installed models in a dropdown. Click the refresh button to re-query if you pull new models while Harvestry is already open.

ℹ️
Ollama endpoint. If you run Ollama on a non-default port or a remote host, you can change the endpoint URL in Settings → Consolidation → Ollama Endpoint. Default: http://localhost:11434.

Choosing an Ollama Model Pro

Not all models are equally suited to lecture summarization. Two things matter most for this use case:

Apple Silicon's unified memory architecture means your Mac's RAM is the key constraint — the model weights load directly into the memory pool shared by CPU and GPU. As a rule of thumb, a model requires roughly 1.2× its on-disk size in RAM to run comfortably. Any model that would push your system past ~80% total RAM usage will swap to SSD and run unusably slowly.

8 GB RAM (M1 / M2 MacBook Air, Mac mini)

With 8 GB unified memory, you can comfortably fit models up to about 4–5 GB on disk. Larger models will swap to SSD and run too slowly to be practical.

ModelPull CommandDiskContextNotes
Qwen3 4B ollama pull qwen3:4b 2.5 GB 256K Best structured-output quality at this tier. Handles long transcripts without truncation.
Phi-4 Mini ollama pull phi4-mini 2.5 GB 128K Very fast generation. Slightly less consistent formatting than Qwen3 4B.
Gemma4 E2B ollama pull gemma4:e2b 7.2 GB 128K Technically fits on 8 GB but leaves little headroom — may swap. Not recommended for this tier.
💡
8 GB honest assessment. Summarization quality on 4B models is noticeably limited compared to larger models — expect good structure but potentially shallow summaries on dense academic content. If quality matters, consider using Claude API instead, which has no local RAM constraint.

16 GB RAM (MacBook Pro M2 / M3, Mac mini M4)

16 GB unlocks 8–14B models and is the sweet spot for most users. You get genuinely useful note quality at comfortable speeds.

ModelPull CommandDiskContextNotes
Qwen3 14B ollama pull qwen3:14b 9.3 GB 256K Top recommendation at this tier. Excellent instruction following. Fits with ~5 GB to spare.
Qwen3 8B ollama pull qwen3:8b 5.2 GB 256K Faster than 14B with slightly lower quality. Good starting point.
Qwen3 30B (MoE) ollama pull qwen3:30b 19 GB 256K MoE model (30B-A3B — 3B active parameters). Reportedly outperforms QwQ-32B at a fraction of the inference cost. Excellent if you want to push quality at this tier.
Gemma4 E4B ollama pull gemma4:e4b 9.6 GB 128K Google's Mixture-of-Experts model. Strong summarization quality; comparable to Qwen3 8B.

32 GB RAM (MacBook Pro M3 Pro / Max, Mac Studio M2)

32 GB is where on-device models become genuinely excellent. Both Gemma4 26B and Qwen3 32B produce output that holds up well against Claude API quality.

ModelPull CommandDiskContextNotes
Gemma4 26B ollama pull gemma4:26b 18 GB 256K Google's flagship MoE model. Runs efficiently — only ~3.8B parameters are active per token, so it's faster than a dense 18B model while drawing on 26B of total knowledge. Excellent prose and structure.
Qwen3 32B ollama pull qwen3:32b 20 GB 256K Dense 32B model. Slightly slower than Gemma4 26B but outstanding at following complex formatting instructions.
Qwen3.6 35B ollama pull qwen3.6:35b 24 GB 256K Newest Qwen generation (post-3.5). Efficient architecture — 35B total parameters, fast inference. Excellent for long lectures.
Mistral Small ollama pull mistral-small:22b 13 GB 128K 22B model with a compact 13 GB footprint. Fast generation, clean prose. Good if you want to leave more RAM free for other apps.

64 GB+ RAM (Mac Studio M2 Ultra / M4 Max, Mac Pro)

On high-memory machines, local models match or exceed what you'd get from most commercial APIs.

ModelPull CommandDiskContextNotes
Qwen3.6 35B ollama pull qwen3.6:35b 24 GB 256K Newest Qwen generation (post-3.5). 256K context handles the longest lectures comfortably. At 24 GB it leaves abundant headroom on a 64 GB machine. Top recommendation for this tier.
Llama 3.3 70B ollama pull llama3.3 43 GB 128K Meta's dense 70B. Highest raw parameter count that fits at this tier. Strong quality but limited to 128K context — may truncate very long lectures.
Gemma4 31B ollama pull gemma4:31b 20 GB 256K Google's dense 31B. Fast inference, excellent quality, and 256K context. Good alternative if you prefer a Google model.
💡
About Mixture-of-Experts (MoE) models. Models labelled as MoE — including Gemma4 26B and Qwen3 30B-A3B — are larger than they appear. They route each token through only a fraction of their total parameters, so they run with the speed and memory footprint of a smaller model while retaining the knowledge of a larger one. gemma4:26b is an 18 GB download but runs like a sharp 4B model with 26B of knowledge behind it.

The Consolidation Prompt

The system prompt sent to the LLM instructs it how to format the output. The default prompt asks the model to produce structured Markdown with:

You can customize the prompt in Settings → Consolidation → System Prompt. The text editor in Settings accepts any Markdown-aware instructions. A Reset to Default button restores the original prompt.

Per-Lecture Control

The consolidation mode (Off / Claude / Ollama) and model selection are stored per lecture. You can have different lectures using different modes:

Changing the mode on a pending lecture takes effect when you next click Begin Processing. Changing the mode on a completed lecture triggers an alert asking if you want to generate or regenerate consolidated notes now.

Disabling After the Fact

To remove consolidated notes from a completed lecture:

  1. On the Step 3 row, switch the mode picker from Claude or Ollama to Off.
  2. A confirmation alert appears: "Remove consolidated notes from this lecture? The HTML export will be regenerated without the notes tab."
  3. Confirm. The consolidated notes are deleted and the HTML is re-exported.

Re-running Consolidation

If a lecture is already complete and you switch its consolidation mode to Claude or Ollama, Harvestry shows an alert: "Generate consolidated notes now?" Confirming runs only Step 3 and then re-exports the HTML. The transcript and screenshots are unchanged.

Privacy

The privacy implications differ significantly between modes:

Consider using Ollama mode for lectures containing sensitive, confidential, or personally identifiable content.