Open Source · MIT License · 8.8K Stars

Free Audio Transcription with Insanely Fast Whisper

Transcribe 150 minutes of audio in 98 seconds. Zero cost, fully local, with speaker diarization and 99-language translation. The fastest Whisper implementation available.

150min

In 98 Seconds

Total Cost

8.8K

GitHub Stars

Languages

Why Pay for Transcription?

Most transcription services charge per minute. Insanely Fast Whisper runs locally on your GPU for free.

Service	Per Minute	Per Hour	Note
OpenAI Whisper API	$0.006/min	$0.36/hr	Cloud-dependent
Google Speech-to-Text	$0.024/min	$1.44/hr	Premium tier
Rev (Human)	$1.50/min	$90.00/hr	Manual turnaround
Otter.ai	~$8.33/mo	$100/yr	Subscription required
AWS Transcribe	$0.024/min	$1.44/hr	Cloud-dependent
Insanely Fast Whisper	$0	$0	Open source, local

Quick Start

# Install with pipx (isolated environment)

pipx install insanely-fast-whisper

# Basic transcription

insanely-fast-whisper --file-name audio.mp3

# With specific model and language

insanely-fast-whisper \

--file-name interview.wav \

--model-name openai/whisper-large-v3 \

--language en \

--transcript-path output.json

# Ultra-fast with distil model

insanely-fast-whisper \

--file-name podcast.mp3 \

--model-name distil-whisper/distil-large-v3 \

--batch-size 24

Speaker Diarization

Automatically identify who said what. Essential for interviews, meetings, and multi-speaker podcasts.

# Enable speaker diarization (requires HF token)

insanely-fast-whisper \

--file-name meeting.mp3 \

--model-name openai/whisper-large-v3 \

--diarize \

--hf-token YOUR_HF_TOKEN \

--num-speakers 3 \

--transcript-path meeting.json

Example Output

SPEAKER_00: Welcome to the show. Today we have a special guest.

SPEAKER_01: Thanks for having me. Excited to be here.

SPEAKER_00: Let's start with your background in AI research.

SPEAKER_02: Actually, can I jump in with a question first?

Translation

Transcribe and translate audio from 99 languages into English in a single pass.

# Translate Spanish audio to English text

insanely-fast-whisper \

--file-name spanish_interview.mp3 \

--model-name openai/whisper-large-v3 \

--task translate \

--transcript-path english_output.json

# Transcribe in original language (no translation)

insanely-fast-whisper \

--file-name japanese_podcast.mp3 \

--task transcribe \

--language ja

Supported Languages

Single Pass

Transcribe + Translate

Large-v3

Best Accuracy Model

How It Compares

Feature	Insanely Fast Whisper	OpenAI Whisper API	Google STT	AWS Transcribe
Accuracy	Large-v3 (best)	Large-v2	Varies by model	Custom models
Speed	150min in 98sec	API-limited	API-limited	Near realtime
Cost	Free (local GPU)	$0.006/min	$0.024/min	$0.024/min
Privacy	Fully local	Cloud upload	Cloud upload	Cloud upload
Speaker ID	Yes (diarize)	No	Yes	Yes
Translation	99 languages	No	No	No
Self-hosted	Yes	No	No	No
Batch processing	Yes	Limited	Yes	Yes

Use Cases

Podcasters

Generate full episode transcripts and show notes in seconds, not hours

Journalists

Transcribe interviews with speaker labels to identify who said what

Researchers

Process hundreds of hours of recorded interviews for qualitative analysis

Lawyers

Transcribe depositions, hearings, and client calls with zero cloud exposure

Content Creators

Turn YouTube videos into blog posts, captions, and social content

Students

Convert lecture recordings into searchable, study-ready text notes

Tips

1.Use a GPU with at least 10GB VRAM for large-v3. For smaller GPUs, use distil-large-v3 — nearly the same accuracy at 6x speed.
2.For long files (2hr+), use flash attention 2 with --batch-size 24 to maximize throughput without running out of memory.
3.Speaker diarization requires a Hugging Face token. Accept the terms for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 on the Hub first.
4.Always specify --language if you know the source language. Auto-detection works but wastes the first 30 seconds on detection.
5.Pipe output to jq for clean JSON processing: add --transcript-path output.json then parse with jq.
6.For maximum accuracy on noisy audio, preprocess with ffmpeg: ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000" clean.wav

Want AI to automate the full pipeline?

Our AI Brain Pro includes transcription workflows, content repurposing, and automated publishing — all integrated.

Get AI Brain Pro — $67