NovaKitv1.0

Speech-to-Text

Transcribe audio to text with speaker detection and timestamps

Speech-to-Text

Transcribe audio files to text with support for multiple languages, speaker detection (diarization), timestamps, and translation.

Try it Now

Test the Speech-to-Text API directly in your browser:

Endpoint

POST /audio/transcriptions

Required scope: stt

Request Body

{
  "file": "https://example.com/audio.mp3",
  "model": "fal-ai/whisper",
  "language": "auto",
  "response_format": "json",
  "timestamp_granularities": ["segment"],
  "diarization": false,
  "translation": false
}

Parameters

ParameterTypeRequiredDefaultDescription
filestringYes-Audio URL or base64 data
modelstringNofal-ai/whisperSTT model to use
languagestringNoautoISO language code or auto
response_formatstringNojsonOutput format
timestamp_granularitiesarrayNo["segment"]Timestamp detail level
diarizationbooleanNofalseEnable speaker detection
translationbooleanNofalseTranslate to English
model_inputsobjectNo-Additional model-specific parameters

Available Models

Model IDTierFeatures
fal-ai/whisperFastStandard transcription (default)
fal-ai/wizperStandardEnhanced accuracy
fal-ai/speech-to-text/turboFastFast transcription
fal-ai/elevenlabs/speech-to-textPremiumElevenLabs quality

Response Formats

FormatDescription
jsonJSON with text, segments, and metadata
textPlain text only
srtSubRip subtitle format
vttWebVTT subtitle format

Timestamp Granularities

  • segment - Sentence/phrase level timestamps
  • word - Individual word timestamps

Response (JSON format)

{
  "text": "Hello, this is a test recording.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a test recording.",
      "speaker": null
    }
  ],
  "language": "en",
  "duration": 2.5,
  "model": "fal-ai/whisper",
  "model_tier": "fast",
  "usage": {
    "seconds_used": 3,
    "quota_multiplier": 1.0,
    "seconds_remaining": 3597
  }
}

Quota & Pricing

STT uses the stt_seconds quota bucket. Duration is multiplied by the model tier:

TierMultiplierExample Models
Fast1xWhisper
Standard1.25xWizper
Premium1.5xWhisper Diarize

Plan limits:

PlanSTT Seconds/Month
Free-
Pro3,600 (60 min)
Business18,000 (300 min)

Examples

curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/podcast.mp3",
    "language": "auto",
    "diarization": true
  }'
import requests

response = requests.post(
    "https://www.novakit.ai/api/v1/audio/transcriptions",
    headers={
        "Authorization": "Bearer sk_your_api_key",
        "Content-Type": "application/json"
    },
    json={
        "file": "https://example.com/podcast.mp3",
        "language": "auto",
        "diarization": True,
        "timestamp_granularities": ["segment", "word"]
    }
)

data = response.json()
print(f"Transcript: {data['text']}")
print(f"Duration: {data['duration']} seconds")
print(f"Detected language: {data['language']}")
const response = await fetch(
  "https://www.novakit.ai/api/v1/audio/transcriptions",
  {
    method: "POST",
    headers: {
      "Authorization": "Bearer sk_your_api_key",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      file: "https://example.com/podcast.mp3",
      language: "auto",
      diarization: true,
    }),
  }
);

const data = await response.json();
console.log(`Transcript: ${data.text}`);

Speaker Diarization

Enable diarization: true to detect different speakers:

{
  "text": "Hello. Hi there. How are you?",
  "segments": [
    {
      "start": 0.0,
      "end": 0.8,
      "text": "Hello.",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 1.0,
      "end": 1.5,
      "text": "Hi there.",
      "speaker": "SPEAKER_01"
    },
    {
      "start": 2.0,
      "end": 2.8,
      "text": "How are you?",
      "speaker": "SPEAKER_00"
    }
  ]
}

Speaker labels (SPEAKER_00, SPEAKER_01, etc.) are consistent within a single transcription but may vary between requests.

Translation

Set translation: true to translate non-English audio to English:

{
  "file": "https://example.com/spanish-audio.mp3",
  "translation": true
}

The response will contain English text regardless of the source language. The original language is still detected and returned in the language field.

Generating Subtitles

Request SRT or VTT format for video subtitles:

curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/video.mp4",
    "response_format": "srt"
  }'

Response:

1
00:00:00,000 --> 00:00:02,500
Hello, this is a test recording.

2
00:00:03,000 --> 00:00:05,200
And this is the second sentence.
curl -X POST https://www.novakit.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/video.mp4",
    "response_format": "vtt"
  }'

Response:

WEBVTT

00:00:00.000 --> 00:00:02.500
Hello, this is a test recording.

00:00:03.000 --> 00:00:05.200
And this is the second sentence.

Word-Level Timestamps

Request word-level timestamps for precise synchronization:

{
  "file": "https://example.com/audio.mp3",
  "timestamp_granularities": ["word"]
}

Response includes:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello world",
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.5},
        {"word": "world", "start": 0.6, "end": 1.0}
      ]
    }
  ]
}

Supported Languages

The API supports 90+ languages including:

CodeLanguageCodeLanguage
enEnglishdeGerman
esSpanishfrFrench
zhChinesejaJapanese
koKoreanptPortuguese
arArabichiHindi
ruRussianitItalian
nlDutchplPolish
trTurkishviVietnamese

Use language: "auto" for automatic language detection.

Error Handling

Common errors you may encounter:

StatusErrorSolution
400file is requiredProvide audio URL or base64
402STT limit exceededUpgrade plan or wait for reset
403Model tier not allowedUpgrade plan for premium models
500Transcription failedCheck audio format/quality

Supported Audio Formats

  • MP3
  • WAV
  • M4A
  • FLAC
  • OGG
  • WebM

For best results, use audio with:

  • Clear speech without excessive background noise
  • Sample rate of 16kHz or higher
  • Single speaker or clear speaker separation

On this page