Transcription
On-device speech recognition using WhisperKit and the Apple Neural Engine. No audio ever leaves your Mac.
How Transcription Works
Harvestry transcribes video audio using WhisperKit, an optimized implementation of OpenAI's Whisper automatic speech recognition (ASR) model built for Apple Silicon. The model runs directly on the Apple Neural Engine (ANE) — the dedicated machine-learning accelerator present in every M-series chip.
Key properties of the transcription engine:
- Fully on-device. No audio or text is sent anywhere. The ANE processes everything locally.
- Word-level timestamps. WhisperKit returns timestamps for every individual word, not just sentence or segment boundaries. These timestamps power the per-word highlight sync in the exported audio player.
- Auto language detection. Whisper detects the spoken language automatically. It performs best on content where a single language dominates.
- Background processing. Transcription runs asynchronously and does not block the UI. You can continue using Harvestry while transcription runs.
Whisper Models
Five model sizes are available. Choose based on how you want to balance accuracy against speed and disk space.
| Model | Disk Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| Tiny | ~75 MB | Fastest | Basic | Testing; very short clips; clearly-spoken English where accuracy is less critical |
| Base | ~150 MB | Very fast | Good | Good balance of size and quality for simple content; podcasts; meeting recordings |
| Small ★ | ~250 MB | Fast | Better | Recommended starting point; handles most lectures, accented speakers, and technical terms well |
| Medium | ~770 MB | Moderate | High | Dense academic content; multiple speakers; stronger accents; non-English material |
| Large Turbo | ~800 MB | Moderate | Best | Maximum accuracy; complex or highly technical lectures; situations where every word matters |
All five models produce word-level timestamps. The accuracy difference is most noticeable with technical jargon, accented speech, and non-English languages. ★ Small is selected by default.
Downloading a Model
Models are not bundled with the app — they are downloaded on demand from Hugging Face and cached in ~/Library/Caches/. You must download at least one model before processing your first lecture.
Click the gear icon in the top-right toolbar, or use the menu bar: Harvestry → Settings.
The first tab in Settings is Transcription. It shows the three available models and their download status.
Click the model row to select it, then click Download. A progress bar shows download progress. The download requires roughly 75 MB–800 MB depending on the model.
The model status changes to Ready with a green checkmark. You can close Settings and begin processing.
Selecting a Model Per Lecture
Each lecture can be transcribed with any downloaded model, regardless of the global default set in Settings. On the lecture detail view, the Transcription step row shows the currently selected model in its subtitle. Click the model name to open a menu picker and select a different model before starting processing.
The model choice is stored per lecture and shown in the step row even after processing completes, so you can always see which model was used.
Model Mismatch Warning
If you change the global default model in Settings after a lecture has already been transcribed, the step row for that lecture will appear in orange. This indicates a mismatch between the model that was used and the current global default — it does not mean anything is wrong, just that the transcript was generated with a different model than your current preference.
To clear the mismatch warning, either:
- Change the per-lecture model selection back to match the transcript, or
- Retranscribe the lecture with the new model (see below).
Retranscribing
You can re-run transcription on a completed lecture at any time, for example to use a more accurate model or to pick up improvements after updating WhisperKit.
- Select the lecture in the sidebar.
- In the detail view, click the ⋯ overflow menu in the toolbar.
- Choose Retranscribe.
- A confirmation dialog warns that the existing transcript will be replaced. Confirm to proceed.
Retranscription runs only steps 1 and 4 of the pipeline — your existing screenshots are preserved. The HTML export is regenerated after the new transcript is ready.
Updating WhisperKit Models
Harvestry checks for model updates from Hugging Face every 24 hours. When an update is available, the model row in Settings shows an update badge. Click Update to download the newer weights. The update check compares local snapshot SHAs against the Hugging Face Hub HEAD.