AI Starter Package
Open Source · MIT License · 8.8K Stars

Free Audio Transcription with Insanely Fast Whisper

Transcribe 150 minutes of audio in 98 seconds. Zero cost, fully local, with speaker diarization and 99-language translation. The fastest Whisper implementation available.

150min
In 98 Seconds
$0
Total Cost
8.8K
GitHub Stars
99
Languages

Why Pay for Transcription?

Most transcription services charge per minute. Insanely Fast Whisper runs locally on your GPU for free.

ServicePer MinutePer HourNote
OpenAI Whisper API$0.006/min$0.36/hrCloud-dependent
Google Speech-to-Text$0.024/min$1.44/hrPremium tier
Rev (Human)$1.50/min$90.00/hrManual turnaround
Otter.ai~$8.33/mo$100/yrSubscription required
AWS Transcribe$0.024/min$1.44/hrCloud-dependent
Insanely Fast Whisper$0$0Open source, local

Quick Start

# Install with pipx (isolated environment)
pipx install insanely-fast-whisper
# Basic transcription
insanely-fast-whisper --file-name audio.mp3
# With specific model and language
insanely-fast-whisper \
--file-name interview.wav \
--model-name openai/whisper-large-v3 \
--language en \
--transcript-path output.json
# Ultra-fast with distil model
insanely-fast-whisper \
--file-name podcast.mp3 \
--model-name distil-whisper/distil-large-v3 \
--batch-size 24

Speaker Diarization

Automatically identify who said what. Essential for interviews, meetings, and multi-speaker podcasts.

# Enable speaker diarization (requires HF token)
insanely-fast-whisper \
--file-name meeting.mp3 \
--model-name openai/whisper-large-v3 \
--diarize \
--hf-token YOUR_HF_TOKEN \
--num-speakers 3 \
--transcript-path meeting.json

Example Output

SPEAKER_00: Welcome to the show. Today we have a special guest.
SPEAKER_01: Thanks for having me. Excited to be here.
SPEAKER_00: Let's start with your background in AI research.
SPEAKER_02: Actually, can I jump in with a question first?

Translation

Transcribe and translate audio from 99 languages into English in a single pass.

# Translate Spanish audio to English text
insanely-fast-whisper \
--file-name spanish_interview.mp3 \
--model-name openai/whisper-large-v3 \
--task translate \
--transcript-path english_output.json
# Transcribe in original language (no translation)
insanely-fast-whisper \
--file-name japanese_podcast.mp3 \
--task transcribe \
--language ja
99
Supported Languages
Single Pass
Transcribe + Translate
Large-v3
Best Accuracy Model

How It Compares

FeatureInsanely Fast WhisperOpenAI Whisper APIGoogle STTAWS Transcribe
AccuracyLarge-v3 (best)Large-v2Varies by modelCustom models
Speed150min in 98secAPI-limitedAPI-limitedNear realtime
CostFree (local GPU)$0.006/min$0.024/min$0.024/min
PrivacyFully localCloud uploadCloud uploadCloud upload
Speaker IDYes (diarize)NoYesYes
Translation99 languagesNoNoNo
Self-hostedYesNoNoNo
Batch processingYesLimitedYesYes

Use Cases

Podcasters

Generate full episode transcripts and show notes in seconds, not hours

Journalists

Transcribe interviews with speaker labels to identify who said what

Researchers

Process hundreds of hours of recorded interviews for qualitative analysis

Lawyers

Transcribe depositions, hearings, and client calls with zero cloud exposure

Content Creators

Turn YouTube videos into blog posts, captions, and social content

Students

Convert lecture recordings into searchable, study-ready text notes

Tips

  • 1.Use a GPU with at least 10GB VRAM for large-v3. For smaller GPUs, use distil-large-v3 — nearly the same accuracy at 6x speed.
  • 2.For long files (2hr+), use flash attention 2 with --batch-size 24 to maximize throughput without running out of memory.
  • 3.Speaker diarization requires a Hugging Face token. Accept the terms for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 on the Hub first.
  • 4.Always specify --language if you know the source language. Auto-detection works but wastes the first 30 seconds on detection.
  • 5.Pipe output to jq for clean JSON processing: add --transcript-path output.json then parse with jq.
  • 6.For maximum accuracy on noisy audio, preprocess with ffmpeg: ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000" clean.wav

Want AI to automate the full pipeline?

Our AI Brain Pro includes transcription workflows, content repurposing, and automated publishing — all integrated.

Get AI Brain Pro — $97