Beyond Text: Building Multimodal AI Apps with Image, Video, and Audio
Text is just the beginning. Learn how to build applications that generate, analyze, and transform images, videos, and audio using modern AI APIs.
Beyond Text: Building Multimodal AI Apps with Image, Video, and Audio
ChatGPT taught the world that AI can write. But text is just the beginning.
Modern AI can:
- Generate photorealistic images from descriptions
- Create videos from text prompts
- Compose original music
- Clone voices with seconds of sample audio
- Understand what's in images and videos
- Transform one medium into another
This guide shows you how to build applications that work across modalities.
The Multimodal Landscape
Generation (Create from scratch)
| Modality | What It Does | Leading Models |
|---|---|---|
| Text → Image | Create images from descriptions | FLUX, Stable Diffusion, DALL-E 3 |
| Text → Video | Generate video from prompts | Runway Gen-3, Pika, Kling |
| Text → Audio | Generate music and sound | Suno, Udio, MusicGen |
| Text → Speech | Natural voice synthesis | ElevenLabs, OpenAI TTS |
Understanding (Analyze existing content)
| Modality | What It Does | Leading Models |
|---|---|---|
| Image → Text | Describe and analyze images | GPT-4V, Claude Vision, Gemini |
| Video → Text | Understand video content | Gemini 1.5, GPT-4V |
| Audio → Text | Transcribe speech | Whisper, AssemblyAI |
Transformation (Convert between modalities)
| Input → Output | What It Does | Leading Models |
|---|---|---|
| Image → Image | Edit, style transfer, upscale | FLUX, Stable Diffusion |
| Image → Video | Animate still images | Runway, Pika |
| Speech → Speech | Voice conversion | ElevenLabs |
Image Generation
Getting Started with FLUX
FLUX is the current leader for text-to-image:
import fal_client
def generate_image(prompt, aspect_ratio="16:9"):
result = fal_client.subscribe(
"fal-ai/flux/dev",
arguments={
"prompt": prompt,
"image_size": aspect_ratio,
"num_images": 1,
"enable_safety_checker": True
}
)
return result["images"][0]["url"]
Crafting Effective Image Prompts
Bad prompt:
"A dog"
Good prompt:
"A golden retriever puppy playing in autumn leaves, warm afternoon sunlight, shallow depth of field, professional pet photography, Canon EOS R5"
Prompt components that matter:
- Subject: What's in the image
- Setting: Where and when
- Style: Photography style, art style, mood
- Technical: Camera, lighting, composition
- Quality: Modifiers for fidelity
Image Editing
Transform existing images:
def edit_image(image_url, prompt, mask_url=None):
arguments = {
"image_url": image_url,
"prompt": prompt,
}
if mask_url:
arguments["mask_url"] = mask_url # For inpainting
result = fal_client.subscribe(
"fal-ai/flux/dev/image-to-image",
arguments=arguments
)
return result["images"][0]["url"]
Use cases:
- Inpainting: Replace parts of an image
- Outpainting: Extend image beyond borders
- Style transfer: Apply artistic styles
- Upscaling: Increase resolution
Practical Example: Product Image Generator
class ProductImageGenerator:
def __init__(self):
self.base_style = """
professional product photography,
clean white background,
soft studio lighting,
high resolution,
e-commerce style
"""
def generate_product_image(self, product_name, description, angle="front"):
prompt = f"""
{product_name}, {description},
{angle} view,
{self.base_style}
"""
return generate_image(prompt, aspect_ratio="1:1")
def generate_lifestyle_image(self, product_name, setting):
prompt = f"""
{product_name} in use,
{setting},
lifestyle photography,
natural lighting,
editorial style
"""
return generate_image(prompt, aspect_ratio="16:9")
# Usage
generator = ProductImageGenerator()
product_shot = generator.generate_product_image(
"wireless headphones",
"matte black, premium design",
angle="45 degree"
)
lifestyle_shot = generator.generate_lifestyle_image(
"wireless headphones",
"person working at modern desk, urban apartment"
)
Video Generation
Text-to-Video with Runway/Kling
def generate_video(prompt, duration=4):
result = fal_client.subscribe(
"fal-ai/kling-video/v1/standard/text-to-video",
arguments={
"prompt": prompt,
"duration": duration, # seconds
"aspect_ratio": "16:9"
}
)
return result["video"]["url"]
Image-to-Video (Animation)
Turn static images into videos:
def animate_image(image_url, motion_prompt):
result = fal_client.subscribe(
"fal-ai/runway-gen3/turbo/image-to-video",
arguments={
"image_url": image_url,
"prompt": motion_prompt # Describe the motion
}
)
return result["video"]["url"]
Motion prompts focus on movement:
"Camera slowly zooms in, leaves gently swaying in the wind" "Subject turns head to look at camera, subtle smile" "Smooth pan from left to right, clouds drifting"
Practical Example: Marketing Video Generator
class MarketingVideoGenerator:
def create_product_reveal(self, product_image_url, product_name):
# Animate the product image
motion = "slow zoom out revealing full product, subtle rotation, premium feel"
video_url = animate_image(product_image_url, motion)
return video_url
def create_text_animation(self, headline, subtext):
prompt = f"""
Kinetic typography animation,
main text "{headline}" appears with impact,
then "{subtext}" fades in below,
modern, clean design, dark background
"""
return generate_video(prompt, duration=3)
def create_scene_video(self, scene_description):
prompt = f"""
{scene_description},
cinematic quality,
smooth camera movement,
professional commercial style
"""
return generate_video(prompt, duration=4)
Audio Generation
Text-to-Speech
Natural voice synthesis:
from openai import OpenAI
client = OpenAI()
def generate_speech(text, voice="alloy"):
response = client.audio.speech.create(
model="tts-1-hd",
voice=voice, # alloy, echo, fable, onyx, nova, shimmer
input=text
)
return response.content # Audio bytes
# With ElevenLabs for more control
import elevenlabs
def generate_speech_elevenlabs(text, voice_id, settings=None):
audio = elevenlabs.generate(
text=text,
voice=voice_id,
model="eleven_multilingual_v2",
voice_settings=settings or elevenlabs.VoiceSettings(
stability=0.5,
similarity_boost=0.75
)
)
return audio
Speech-to-Text
Transcribe audio:
def transcribe_audio(audio_file):
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json"
)
return {
"text": response.text,
"segments": response.segments, # With timestamps
"language": response.language
}
Music Generation
Generate original music:
def generate_music(prompt, duration=30):
result = fal_client.subscribe(
"fal-ai/stable-audio",
arguments={
"prompt": prompt,
"seconds_total": duration
}
)
return result["audio_file"]["url"]
Music prompts:
"Upbeat electronic music, energetic, modern, suitable for product video" "Calm acoustic guitar, peaceful, nature documentary style" "Epic orchestral, building intensity, trailer music"
Practical Example: Podcast Generator
class PodcastGenerator:
def __init__(self):
self.host_voice = "voice_id_host"
self.guest_voice = "voice_id_guest"
def generate_intro(self, podcast_name, episode_title):
intro_text = f"""
Welcome to {podcast_name}!
Today's episode: {episode_title}.
Let's dive in.
"""
# Generate intro music
music = generate_music(
"podcast intro, professional, friendly, 5 seconds",
duration=5
)
# Generate voice
voice = generate_speech_elevenlabs(intro_text, self.host_voice)
return {"music": music, "voice": voice}
def generate_dialogue(self, script):
"""
Script format:
[
{"speaker": "host", "text": "So tell us about..."},
{"speaker": "guest", "text": "Well, it started..."}
]
"""
audio_segments = []
for line in script:
voice_id = self.host_voice if line["speaker"] == "host" else self.guest_voice
audio = generate_speech_elevenlabs(line["text"], voice_id)
audio_segments.append(audio)
return audio_segments
Vision: Understanding Images and Video
Image Analysis
def analyze_image(image_url, question):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {"url": image_url}}
]
}
]
)
return response.choices[0].message.content
# Usage
description = analyze_image(
"https://example.com/product.jpg",
"Describe this product in detail for an e-commerce listing"
)
issues = analyze_image(
"https://example.com/damage.jpg",
"Identify any damage or issues visible in this image"
)
Video Analysis
For videos, extract frames or use video-capable models:
def analyze_video(video_url, question):
# With Gemini 1.5 (supports direct video)
response = genai.generate_content([
question,
{"video_url": video_url}
])
return response.text
# Or extract frames and analyze
def analyze_video_frames(video_path, question, frame_count=5):
frames = extract_frames(video_path, count=frame_count)
responses = []
for frame in frames:
analysis = analyze_image(frame, question)
responses.append(analysis)
# Synthesize
summary = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Summarize these video frame analyses: {responses}"
}]
).choices[0].message.content
return summary
Practical Example: Content Moderation
class ContentModerator:
def __init__(self):
self.policies = """
Flag content containing:
- Violence or gore
- Explicit adult content
- Hate symbols or speech
- Personal information visible
- Copyright violations (logos, characters)
"""
def moderate_image(self, image_url):
analysis = analyze_image(
image_url,
f"""
Analyze this image for policy violations.
Policies: {self.policies}
Return JSON:
{{
"safe": boolean,
"violations": ["list of issues"],
"confidence": 0-1,
"recommendation": "approve" | "reject" | "review"
}}
"""
)
return json.loads(analysis)
def moderate_video(self, video_url):
# Sample frames throughout video
analysis = analyze_video(
video_url,
f"Check this video for policy violations: {self.policies}"
)
return analysis
Building Multimodal Applications
Example 1: AI-Powered Content Studio
class ContentStudio:
def create_social_post(self, topic, platform):
# Generate copy
copy = ai.chat(
message=f"Write a {platform} post about: {topic}"
)
# Generate matching image
image_prompt = ai.chat(
message=f"Create an image prompt for: {copy.text}"
)
image = generate_image(image_prompt.text)
return {
"copy": copy.text,
"image": image
}
def create_video_ad(self, product, features, target_audience):
# Generate script
script = ai.chat(
message=f"""
Write a 15-second video ad script for {product}.
Features: {features}
Target audience: {target_audience}
"""
)
# Generate voiceover
voiceover = generate_speech(script.text)
# Generate background music
music = generate_music(
f"advertising music, {target_audience}, energetic, 15 seconds"
)
# Generate visuals
scenes = script.text.split("\n\n")
videos = [generate_video(scene) for scene in scenes[:3]]
return {
"script": script.text,
"voiceover": voiceover,
"music": music,
"videos": videos
}
Example 2: Accessibility Tool
class AccessibilityTool:
def describe_image_for_blind(self, image_url):
return analyze_image(
image_url,
"""
Describe this image for a blind user.
Include:
- Main subjects and their positions
- Colors and visual style
- Any text visible
- Emotional tone or mood
- Relevant context
"""
)
def generate_audio_description(self, image_url):
description = self.describe_image_for_blind(image_url)
audio = generate_speech(description, voice="nova")
return audio
def transcribe_video_for_deaf(self, video_url):
# Get audio transcription
transcript = transcribe_audio(video_url)
# Add visual descriptions
visual_desc = analyze_video(
video_url,
"Describe important visual elements not conveyed by dialogue"
)
return {
"transcript": transcript,
"visual_descriptions": visual_desc
}
Example 3: E-Learning Content Generator
class ELearningGenerator:
def create_lesson(self, topic, target_age):
# Generate lesson content
content = ai.chat(
message=f"Create a lesson about {topic} for age {target_age}"
)
# Generate illustrations
concepts = extract_key_concepts(content.text)
illustrations = {}
for concept in concepts:
prompt = f"Educational illustration of {concept}, friendly style, age {target_age}"
illustrations[concept] = generate_image(prompt)
# Generate narration
narration = generate_speech(
content.text,
voice="nova" # Friendly, clear voice
)
# Generate quiz
quiz = ai.chat(
message=f"Create a 5-question quiz about: {content.text}"
)
return {
"content": content.text,
"illustrations": illustrations,
"narration": narration,
"quiz": quiz.text
}
NovaKit Multimodal Features
NovaKit provides unified access to multimodal AI:
Image Generation: FLUX, Stable Diffusion with simple API Video Generation: Text-to-video, image-to-video Audio: Text-to-speech, speech-to-text, music generation Vision: Image analysis and understanding Unified Interface: Same patterns across modalities
from novakit import Image, Video, Audio
# Generate image
image = Image.generate("Product photo of wireless earbuds")
# Animate to video
video = Video.from_image(image, motion="slow rotation, studio lighting")
# Add voiceover
audio = Audio.speak("Introducing our new wireless earbuds...")
# Combine
final = Video.compose(video, audio)
Best Practices
- Prompt engineering matters: Spend time crafting effective prompts
- Quality vs speed tradeoffs: Fast models for drafts, quality models for finals
- Handle failures gracefully: Generation can fail; have fallbacks
- Cache when possible: Regenerating identical content wastes money
- Respect content policies: All platforms have usage policies
- Consider copyright: AI-generated content has legal nuances
Ready to build multimodal applications? NovaKit provides unified APIs for image, video, and audio generation—all in one platform.
Enjoyed this article? Share it with others.