Speech-to-Text

Transcribe audio files to text with support for multiple languages, speaker detection (diarization), timestamps, and translation.

Try it Now

Test the Speech-to-Text API directly in your browser:

Endpoint

POST /audio/transcriptions

Required scope: stt

Request Body

{
  "file": "https://example.com/audio.mp3",
  "model": "fal-ai/whisper",
  "language": "auto",
  "response_format": "json",
  "timestamp_granularities": ["segment"],
  "diarization": false,
  "translation": false
}

Parameters

Parameter	Type	Required	Default	Description
`file`	string	Yes	-	Audio URL or base64 data
`model`	string	No	`fal-ai/whisper`	STT model to use
`language`	string	No	`auto`	ISO language code or `auto`
`response_format`	string	No	`json`	Output format
`timestamp_granularities`	array	No	`["segment"]`	Timestamp detail level
`diarization`	boolean	No	`false`	Enable speaker detection
`translation`	boolean	No	`false`	Translate to English
`model_inputs`	object	No	-	Additional model-specific parameters

Available Models

Model ID	Tier	Features
`fal-ai/whisper`	Fast	Standard transcription (default)
`fal-ai/wizper`	Standard	Enhanced accuracy
`fal-ai/speech-to-text/turbo`	Fast	Fast transcription
`fal-ai/elevenlabs/speech-to-text`	Premium	ElevenLabs quality

Response Formats

Format	Description
`json`	JSON with text, segments, and metadata
`text`	Plain text only
`srt`	SubRip subtitle format
`vtt`	WebVTT subtitle format

Timestamp Granularities

segment - Sentence/phrase level timestamps
word - Individual word timestamps

Response (JSON format)

{
  "text": "Hello, this is a test recording.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a test recording.",
      "speaker": null
    }
  ],
  "language": "en",
  "duration": 2.5,
  "model": "fal-ai/whisper",
  "model_tier": "fast",
  "usage": {
    "seconds_used": 3,
    "quota_multiplier": 1.0,
    "seconds_remaining": 3597
  }
}

Quota & Pricing

STT uses the stt_seconds quota bucket. Duration is multiplied by the model tier:

Tier	Multiplier	Example Models
Fast	1x	Whisper
Standard	1.25x	Wizper
Premium	1.5x	Whisper Diarize

Plan limits:

Plan	STT Seconds/Month
Free	-
Pro	3,600 (60 min)
Business	18,000 (300 min)

Examples

curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/podcast.mp3",
    "language": "auto",
    "diarization": true
  }'

import requests

response = requests.post(
    "https://www.novakit.ai/api/v1/audio/transcriptions",
    headers={
        "Authorization": "Bearer sk_your_api_key",
        "Content-Type": "application/json"
    },
    json={
        "file": "https://example.com/podcast.mp3",
        "language": "auto",
        "diarization": True,
        "timestamp_granularities": ["segment", "word"]
    }
)

data = response.json()
print(f"Transcript: {data['text']}")
print(f"Duration: {data['duration']} seconds")
print(f"Detected language: {data['language']}")

const response = await fetch(
  "https://www.novakit.ai/api/v1/audio/transcriptions",
  {
    method: "POST",
    headers: {
      "Authorization": "Bearer sk_your_api_key",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      file: "https://example.com/podcast.mp3",
      language: "auto",
      diarization: true,
    }),
  }
);

const data = await response.json();
console.log(`Transcript: ${data.text}`);

Speaker Diarization

Enable diarization: true to detect different speakers:

{
  "text": "Hello. Hi there. How are you?",
  "segments": [
    {
      "start": 0.0,
      "end": 0.8,
      "text": "Hello.",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 1.0,
      "end": 1.5,
      "text": "Hi there.",
      "speaker": "SPEAKER_01"
    },
    {
      "start": 2.0,
      "end": 2.8,
      "text": "How are you?",
      "speaker": "SPEAKER_00"
    }
  ]
}

Speaker labels (SPEAKER_00, SPEAKER_01, etc.) are consistent within a single transcription but may vary between requests.

Translation

Set translation: true to translate non-English audio to English:

{
  "file": "https://example.com/spanish-audio.mp3",
  "translation": true
}

The response will contain English text regardless of the source language. The original language is still detected and returned in the language field.

Generating Subtitles

Request SRT or VTT format for video subtitles:

curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/video.mp4",
    "response_format": "srt"
  }'

Response:

1
00:00:00,000 --> 00:00:02,500
Hello, this is a test recording.

2
00:00:03,000 --> 00:00:05,200
And this is the second sentence.

curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/video.mp4",
    "response_format": "vtt"
  }'

Response:

WEBVTT

00:00:00.000 --> 00:00:02.500
Hello, this is a test recording.

00:00:03.000 --> 00:00:05.200
And this is the second sentence.

Word-Level Timestamps

Request word-level timestamps for precise synchronization:

{
  "file": "https://example.com/audio.mp3",
  "timestamp_granularities": ["word"]
}

Response includes:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello world",
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.5},
        {"word": "world", "start": 0.6, "end": 1.0}
      ]
    }
  ]
}

Supported Languages

The API supports 90+ languages including:

Code	Language	Code	Language
`en`	English	`de`	German
`es`	Spanish	`fr`	French
`zh`	Chinese	`ja`	Japanese
`ko`	Korean	`pt`	Portuguese
`ar`	Arabic	`hi`	Hindi
`ru`	Russian	`it`	Italian
`nl`	Dutch	`pl`	Polish
`tr`	Turkish	`vi`	Vietnamese

Use language: "auto" for automatic language detection.

Error Handling

Common errors you may encounter:

Status	Error	Solution
400	`file is required`	Provide audio URL or base64
402	`STT limit exceeded`	Upgrade plan or wait for reset
403	`Model tier not allowed`	Upgrade plan for premium models
500	Transcription failed	Check audio format/quality

Supported Audio Formats

MP3
WAV
M4A
FLAC
OGG
WebM

For best results, use audio with:

Clear speech without excessive background noise
Sample rate of 16kHz or higher
Single speaker or clear speaker separation

Speech-to-Text

On this page