Signup Bonus

Get +1,000 bonus credits on Pro, +2,500 on Business. Start building today.

View plans
NovaKit
Back to Blog

Beyond Text: Building Multimodal AI Apps with Image, Video, and Audio

Text is just the beginning. Learn how to build applications that generate, analyze, and transform images, videos, and audio using modern AI APIs.

15 min read
Share:

Beyond Text: Building Multimodal AI Apps with Image, Video, and Audio

ChatGPT taught the world that AI can write. But text is just the beginning.

Modern AI can:

  • Generate photorealistic images from descriptions
  • Create videos from text prompts
  • Compose original music
  • Clone voices with seconds of sample audio
  • Understand what's in images and videos
  • Transform one medium into another

This guide shows you how to build applications that work across modalities.

The Multimodal Landscape

Generation (Create from scratch)

ModalityWhat It DoesLeading Models
Text → ImageCreate images from descriptionsFLUX, Stable Diffusion, DALL-E 3
Text → VideoGenerate video from promptsRunway Gen-3, Pika, Kling
Text → AudioGenerate music and soundSuno, Udio, MusicGen
Text → SpeechNatural voice synthesisElevenLabs, OpenAI TTS

Understanding (Analyze existing content)

ModalityWhat It DoesLeading Models
Image → TextDescribe and analyze imagesGPT-4V, Claude Vision, Gemini
Video → TextUnderstand video contentGemini 1.5, GPT-4V
Audio → TextTranscribe speechWhisper, AssemblyAI

Transformation (Convert between modalities)

Input → OutputWhat It DoesLeading Models
Image → ImageEdit, style transfer, upscaleFLUX, Stable Diffusion
Image → VideoAnimate still imagesRunway, Pika
Speech → SpeechVoice conversionElevenLabs

Image Generation

Getting Started with FLUX

FLUX is the current leader for text-to-image:

import fal_client

def generate_image(prompt, aspect_ratio="16:9"):
    result = fal_client.subscribe(
        "fal-ai/flux/dev",
        arguments={
            "prompt": prompt,
            "image_size": aspect_ratio,
            "num_images": 1,
            "enable_safety_checker": True
        }
    )

    return result["images"][0]["url"]

Crafting Effective Image Prompts

Bad prompt:

"A dog"

Good prompt:

"A golden retriever puppy playing in autumn leaves, warm afternoon sunlight, shallow depth of field, professional pet photography, Canon EOS R5"

Prompt components that matter:

  1. Subject: What's in the image
  2. Setting: Where and when
  3. Style: Photography style, art style, mood
  4. Technical: Camera, lighting, composition
  5. Quality: Modifiers for fidelity

Image Editing

Transform existing images:

def edit_image(image_url, prompt, mask_url=None):
    arguments = {
        "image_url": image_url,
        "prompt": prompt,
    }

    if mask_url:
        arguments["mask_url"] = mask_url  # For inpainting

    result = fal_client.subscribe(
        "fal-ai/flux/dev/image-to-image",
        arguments=arguments
    )

    return result["images"][0]["url"]

Use cases:

  • Inpainting: Replace parts of an image
  • Outpainting: Extend image beyond borders
  • Style transfer: Apply artistic styles
  • Upscaling: Increase resolution

Practical Example: Product Image Generator

class ProductImageGenerator:
    def __init__(self):
        self.base_style = """
            professional product photography,
            clean white background,
            soft studio lighting,
            high resolution,
            e-commerce style
        """

    def generate_product_image(self, product_name, description, angle="front"):
        prompt = f"""
            {product_name}, {description},
            {angle} view,
            {self.base_style}
        """

        return generate_image(prompt, aspect_ratio="1:1")

    def generate_lifestyle_image(self, product_name, setting):
        prompt = f"""
            {product_name} in use,
            {setting},
            lifestyle photography,
            natural lighting,
            editorial style
        """

        return generate_image(prompt, aspect_ratio="16:9")

# Usage
generator = ProductImageGenerator()
product_shot = generator.generate_product_image(
    "wireless headphones",
    "matte black, premium design",
    angle="45 degree"
)
lifestyle_shot = generator.generate_lifestyle_image(
    "wireless headphones",
    "person working at modern desk, urban apartment"
)

Video Generation

Text-to-Video with Runway/Kling

def generate_video(prompt, duration=4):
    result = fal_client.subscribe(
        "fal-ai/kling-video/v1/standard/text-to-video",
        arguments={
            "prompt": prompt,
            "duration": duration,  # seconds
            "aspect_ratio": "16:9"
        }
    )

    return result["video"]["url"]

Image-to-Video (Animation)

Turn static images into videos:

def animate_image(image_url, motion_prompt):
    result = fal_client.subscribe(
        "fal-ai/runway-gen3/turbo/image-to-video",
        arguments={
            "image_url": image_url,
            "prompt": motion_prompt  # Describe the motion
        }
    )

    return result["video"]["url"]

Motion prompts focus on movement:

"Camera slowly zooms in, leaves gently swaying in the wind" "Subject turns head to look at camera, subtle smile" "Smooth pan from left to right, clouds drifting"

Practical Example: Marketing Video Generator

class MarketingVideoGenerator:
    def create_product_reveal(self, product_image_url, product_name):
        # Animate the product image
        motion = "slow zoom out revealing full product, subtle rotation, premium feel"
        video_url = animate_image(product_image_url, motion)

        return video_url

    def create_text_animation(self, headline, subtext):
        prompt = f"""
            Kinetic typography animation,
            main text "{headline}" appears with impact,
            then "{subtext}" fades in below,
            modern, clean design, dark background
        """

        return generate_video(prompt, duration=3)

    def create_scene_video(self, scene_description):
        prompt = f"""
            {scene_description},
            cinematic quality,
            smooth camera movement,
            professional commercial style
        """

        return generate_video(prompt, duration=4)

Audio Generation

Text-to-Speech

Natural voice synthesis:

from openai import OpenAI

client = OpenAI()

def generate_speech(text, voice="alloy"):
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,  # alloy, echo, fable, onyx, nova, shimmer
        input=text
    )

    return response.content  # Audio bytes

# With ElevenLabs for more control
import elevenlabs

def generate_speech_elevenlabs(text, voice_id, settings=None):
    audio = elevenlabs.generate(
        text=text,
        voice=voice_id,
        model="eleven_multilingual_v2",
        voice_settings=settings or elevenlabs.VoiceSettings(
            stability=0.5,
            similarity_boost=0.75
        )
    )

    return audio

Speech-to-Text

Transcribe audio:

def transcribe_audio(audio_file):
    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json"
    )

    return {
        "text": response.text,
        "segments": response.segments,  # With timestamps
        "language": response.language
    }

Music Generation

Generate original music:

def generate_music(prompt, duration=30):
    result = fal_client.subscribe(
        "fal-ai/stable-audio",
        arguments={
            "prompt": prompt,
            "seconds_total": duration
        }
    )

    return result["audio_file"]["url"]

Music prompts:

"Upbeat electronic music, energetic, modern, suitable for product video" "Calm acoustic guitar, peaceful, nature documentary style" "Epic orchestral, building intensity, trailer music"

Practical Example: Podcast Generator

class PodcastGenerator:
    def __init__(self):
        self.host_voice = "voice_id_host"
        self.guest_voice = "voice_id_guest"

    def generate_intro(self, podcast_name, episode_title):
        intro_text = f"""
            Welcome to {podcast_name}!
            Today's episode: {episode_title}.
            Let's dive in.
        """

        # Generate intro music
        music = generate_music(
            "podcast intro, professional, friendly, 5 seconds",
            duration=5
        )

        # Generate voice
        voice = generate_speech_elevenlabs(intro_text, self.host_voice)

        return {"music": music, "voice": voice}

    def generate_dialogue(self, script):
        """
        Script format:
        [
            {"speaker": "host", "text": "So tell us about..."},
            {"speaker": "guest", "text": "Well, it started..."}
        ]
        """
        audio_segments = []

        for line in script:
            voice_id = self.host_voice if line["speaker"] == "host" else self.guest_voice
            audio = generate_speech_elevenlabs(line["text"], voice_id)
            audio_segments.append(audio)

        return audio_segments

Vision: Understanding Images and Video

Image Analysis

def analyze_image(image_url, question):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ]
    )

    return response.choices[0].message.content

# Usage
description = analyze_image(
    "https://example.com/product.jpg",
    "Describe this product in detail for an e-commerce listing"
)

issues = analyze_image(
    "https://example.com/damage.jpg",
    "Identify any damage or issues visible in this image"
)

Video Analysis

For videos, extract frames or use video-capable models:

def analyze_video(video_url, question):
    # With Gemini 1.5 (supports direct video)
    response = genai.generate_content([
        question,
        {"video_url": video_url}
    ])

    return response.text

# Or extract frames and analyze
def analyze_video_frames(video_path, question, frame_count=5):
    frames = extract_frames(video_path, count=frame_count)

    responses = []
    for frame in frames:
        analysis = analyze_image(frame, question)
        responses.append(analysis)

    # Synthesize
    summary = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Summarize these video frame analyses: {responses}"
        }]
    ).choices[0].message.content

    return summary

Practical Example: Content Moderation

class ContentModerator:
    def __init__(self):
        self.policies = """
            Flag content containing:
            - Violence or gore
            - Explicit adult content
            - Hate symbols or speech
            - Personal information visible
            - Copyright violations (logos, characters)
        """

    def moderate_image(self, image_url):
        analysis = analyze_image(
            image_url,
            f"""
            Analyze this image for policy violations.
            Policies: {self.policies}

            Return JSON:
            {{
                "safe": boolean,
                "violations": ["list of issues"],
                "confidence": 0-1,
                "recommendation": "approve" | "reject" | "review"
            }}
            """
        )

        return json.loads(analysis)

    def moderate_video(self, video_url):
        # Sample frames throughout video
        analysis = analyze_video(
            video_url,
            f"Check this video for policy violations: {self.policies}"
        )

        return analysis

Building Multimodal Applications

Example 1: AI-Powered Content Studio

class ContentStudio:
    def create_social_post(self, topic, platform):
        # Generate copy
        copy = ai.chat(
            message=f"Write a {platform} post about: {topic}"
        )

        # Generate matching image
        image_prompt = ai.chat(
            message=f"Create an image prompt for: {copy.text}"
        )
        image = generate_image(image_prompt.text)

        return {
            "copy": copy.text,
            "image": image
        }

    def create_video_ad(self, product, features, target_audience):
        # Generate script
        script = ai.chat(
            message=f"""
            Write a 15-second video ad script for {product}.
            Features: {features}
            Target audience: {target_audience}
            """
        )

        # Generate voiceover
        voiceover = generate_speech(script.text)

        # Generate background music
        music = generate_music(
            f"advertising music, {target_audience}, energetic, 15 seconds"
        )

        # Generate visuals
        scenes = script.text.split("\n\n")
        videos = [generate_video(scene) for scene in scenes[:3]]

        return {
            "script": script.text,
            "voiceover": voiceover,
            "music": music,
            "videos": videos
        }

Example 2: Accessibility Tool

class AccessibilityTool:
    def describe_image_for_blind(self, image_url):
        return analyze_image(
            image_url,
            """
            Describe this image for a blind user.
            Include:
            - Main subjects and their positions
            - Colors and visual style
            - Any text visible
            - Emotional tone or mood
            - Relevant context
            """
        )

    def generate_audio_description(self, image_url):
        description = self.describe_image_for_blind(image_url)
        audio = generate_speech(description, voice="nova")
        return audio

    def transcribe_video_for_deaf(self, video_url):
        # Get audio transcription
        transcript = transcribe_audio(video_url)

        # Add visual descriptions
        visual_desc = analyze_video(
            video_url,
            "Describe important visual elements not conveyed by dialogue"
        )

        return {
            "transcript": transcript,
            "visual_descriptions": visual_desc
        }

Example 3: E-Learning Content Generator

class ELearningGenerator:
    def create_lesson(self, topic, target_age):
        # Generate lesson content
        content = ai.chat(
            message=f"Create a lesson about {topic} for age {target_age}"
        )

        # Generate illustrations
        concepts = extract_key_concepts(content.text)
        illustrations = {}
        for concept in concepts:
            prompt = f"Educational illustration of {concept}, friendly style, age {target_age}"
            illustrations[concept] = generate_image(prompt)

        # Generate narration
        narration = generate_speech(
            content.text,
            voice="nova"  # Friendly, clear voice
        )

        # Generate quiz
        quiz = ai.chat(
            message=f"Create a 5-question quiz about: {content.text}"
        )

        return {
            "content": content.text,
            "illustrations": illustrations,
            "narration": narration,
            "quiz": quiz.text
        }

NovaKit Multimodal Features

NovaKit provides unified access to multimodal AI:

Image Generation: FLUX, Stable Diffusion with simple API Video Generation: Text-to-video, image-to-video Audio: Text-to-speech, speech-to-text, music generation Vision: Image analysis and understanding Unified Interface: Same patterns across modalities

from novakit import Image, Video, Audio

# Generate image
image = Image.generate("Product photo of wireless earbuds")

# Animate to video
video = Video.from_image(image, motion="slow rotation, studio lighting")

# Add voiceover
audio = Audio.speak("Introducing our new wireless earbuds...")

# Combine
final = Video.compose(video, audio)

Best Practices

  1. Prompt engineering matters: Spend time crafting effective prompts
  2. Quality vs speed tradeoffs: Fast models for drafts, quality models for finals
  3. Handle failures gracefully: Generation can fail; have fallbacks
  4. Cache when possible: Regenerating identical content wastes money
  5. Respect content policies: All platforms have usage policies
  6. Consider copyright: AI-generated content has legal nuances

Ready to build multimodal applications? NovaKit provides unified APIs for image, video, and audio generation—all in one platform.

Enjoyed this article? Share it with others.

Share:

Related Articles