Speech-to-Text
Transcribe audio to text with speaker detection and timestamps
Speech-to-Text
Transcribe audio files to text with support for multiple languages, speaker detection (diarization), timestamps, and translation.
Try it Now
Test the Speech-to-Text API directly in your browser:
Endpoint
POST /audio/transcriptionsRequired scope: stt
Request Body
{
"file": "https://example.com/audio.mp3",
"model": "fal-ai/whisper",
"language": "auto",
"response_format": "json",
"timestamp_granularities": ["segment"],
"diarization": false,
"translation": false
}Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file | string | Yes | - | Audio URL or base64 data |
model | string | No | fal-ai/whisper | STT model to use |
language | string | No | auto | ISO language code or auto |
response_format | string | No | json | Output format |
timestamp_granularities | array | No | ["segment"] | Timestamp detail level |
diarization | boolean | No | false | Enable speaker detection |
translation | boolean | No | false | Translate to English |
model_inputs | object | No | - | Additional model-specific parameters |
Available Models
| Model ID | Tier | Features |
|---|---|---|
fal-ai/whisper | Fast | Standard transcription (default) |
fal-ai/wizper | Standard | Enhanced accuracy |
fal-ai/speech-to-text/turbo | Fast | Fast transcription |
fal-ai/elevenlabs/speech-to-text | Premium | ElevenLabs quality |
Response Formats
| Format | Description |
|---|---|
json | JSON with text, segments, and metadata |
text | Plain text only |
srt | SubRip subtitle format |
vtt | WebVTT subtitle format |
Timestamp Granularities
segment- Sentence/phrase level timestampsword- Individual word timestamps
Response (JSON format)
{
"text": "Hello, this is a test recording.",
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Hello, this is a test recording.",
"speaker": null
}
],
"language": "en",
"duration": 2.5,
"model": "fal-ai/whisper",
"model_tier": "fast",
"usage": {
"seconds_used": 3,
"quota_multiplier": 1.0,
"seconds_remaining": 3597
}
}Quota & Pricing
STT uses the stt_seconds quota bucket. Duration is multiplied by the model tier:
| Tier | Multiplier | Example Models |
|---|---|---|
| Fast | 1x | Whisper |
| Standard | 1.25x | Wizper |
| Premium | 1.5x | Whisper Diarize |
Plan limits:
| Plan | STT Seconds/Month |
|---|---|
| Free | - |
| Pro | 3,600 (60 min) |
| Business | 18,000 (300 min) |
Examples
curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"file": "https://example.com/podcast.mp3",
"language": "auto",
"diarization": true
}'import requests
response = requests.post(
"https://www.novakit.ai/api/v1/audio/transcriptions",
headers={
"Authorization": "Bearer sk_your_api_key",
"Content-Type": "application/json"
},
json={
"file": "https://example.com/podcast.mp3",
"language": "auto",
"diarization": True,
"timestamp_granularities": ["segment", "word"]
}
)
data = response.json()
print(f"Transcript: {data['text']}")
print(f"Duration: {data['duration']} seconds")
print(f"Detected language: {data['language']}")const response = await fetch(
"https://www.novakit.ai/api/v1/audio/transcriptions",
{
method: "POST",
headers: {
"Authorization": "Bearer sk_your_api_key",
"Content-Type": "application/json",
},
body: JSON.stringify({
file: "https://example.com/podcast.mp3",
language: "auto",
diarization: true,
}),
}
);
const data = await response.json();
console.log(`Transcript: ${data.text}`);Speaker Diarization
Enable diarization: true to detect different speakers:
{
"text": "Hello. Hi there. How are you?",
"segments": [
{
"start": 0.0,
"end": 0.8,
"text": "Hello.",
"speaker": "SPEAKER_00"
},
{
"start": 1.0,
"end": 1.5,
"text": "Hi there.",
"speaker": "SPEAKER_01"
},
{
"start": 2.0,
"end": 2.8,
"text": "How are you?",
"speaker": "SPEAKER_00"
}
]
}Speaker labels (SPEAKER_00, SPEAKER_01, etc.) are consistent within a single transcription but may vary between requests.
Translation
Set translation: true to translate non-English audio to English:
{
"file": "https://example.com/spanish-audio.mp3",
"translation": true
}The response will contain English text regardless of the source language. The original language is still detected and returned in the language field.
Generating Subtitles
Request SRT or VTT format for video subtitles:
curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"file": "https://example.com/video.mp4",
"response_format": "srt"
}'Response:
1
00:00:00,000 --> 00:00:02,500
Hello, this is a test recording.
2
00:00:03,000 --> 00:00:05,200
And this is the second sentence.curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"file": "https://example.com/video.mp4",
"response_format": "vtt"
}'Response:
WEBVTT
00:00:00.000 --> 00:00:02.500
Hello, this is a test recording.
00:00:03.000 --> 00:00:05.200
And this is the second sentence.Word-Level Timestamps
Request word-level timestamps for precise synchronization:
{
"file": "https://example.com/audio.mp3",
"timestamp_granularities": ["word"]
}Response includes:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Hello world",
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "world", "start": 0.6, "end": 1.0}
]
}
]
}Supported Languages
The API supports 90+ languages including:
| Code | Language | Code | Language |
|---|---|---|---|
en | English | de | German |
es | Spanish | fr | French |
zh | Chinese | ja | Japanese |
ko | Korean | pt | Portuguese |
ar | Arabic | hi | Hindi |
ru | Russian | it | Italian |
nl | Dutch | pl | Polish |
tr | Turkish | vi | Vietnamese |
Use language: "auto" for automatic language detection.
Error Handling
Common errors you may encounter:
| Status | Error | Solution |
|---|---|---|
| 400 | file is required | Provide audio URL or base64 |
| 402 | STT limit exceeded | Upgrade plan or wait for reset |
| 403 | Model tier not allowed | Upgrade plan for premium models |
| 500 | Transcription failed | Check audio format/quality |
Supported Audio Formats
- MP3
- WAV
- M4A
- FLAC
- OGG
- WebM
For best results, use audio with:
- Clear speech without excessive background noise
- Sample rate of 16kHz or higher
- Single speaker or clear speaker separation