Video Summarization

Video summarization uses Vision Language Models (VLMs) to analyze video frames and generate text descriptions. The service samples frames from videos, processes them through multimodal models, and returns summaries with persona context integration.

How It Works

The video summarization pipeline:

Frame Sampling: Extract frames from video using specified strategy (uniform, keyframe-based, or adaptive)
Frame Preprocessing: Resize and normalize frames for model input
VLM Inference: Process frames through vision-language model
Persona Context Integration: Include persona information in prompts for domain-specific summaries
Response Generation: Return structured summary with frame indices and confidence scores

Available Models

Llama-4-Maverick

Model ID: meta-llama/Llama-4-Maverick

Type: Mixture of Experts (MoE) with 17B active parameters

Characteristics:

VRAM: 62 GB (4-bit), 240 GB (full precision)
Context length: 10M tokens
Speed: Fast
Multimodal: Yes

Best for:

Long video sequences
Complex scene understanding
Multi-frame reasoning

Example:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "game-video-123",
    "persona_id": "baseball-analyst-456",
    "frame_sample_rate": 2,
    "max_frames": 16
  }'

Gemma-3-27b

Model ID: google/gemma-3-27b-it

Type: Dense model with 27B parameters

Characteristics:

VRAM: 14 GB (4-bit), 54 GB (full precision)
Context length: 8K tokens
Speed: Very fast
Multimodal: Yes

Best for:

Document analysis in videos
OCR and text extraction
Multilingual video content
Limited VRAM environments

Example:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "presentation-123",
    "persona_id": "document-analyst-789",
    "frame_sample_rate": 1,
    "max_frames": 8
  }'

InternVL3-78B

Model ID: OpenGVLab/InternVL3-78B

Type: Dense model with 78B parameters

Characteristics:

VRAM: 40 GB (4-bit), 156 GB (full precision)
Context length: 32K tokens
Speed: Medium
Multimodal: Yes

Best for:

High accuracy requirements
Scientific and technical analysis
Detailed scene descriptions
Benchmark-quality results

Example:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "medical-procedure-123",
    "persona_id": "medical-researcher-456",
    "frame_sample_rate": 1,
    "max_frames": 12
  }'

Pixtral-Large

Model ID: mistralai/Pixtral-Large-Instruct-2411

Type: Large multimodal model

Characteristics:

VRAM: 62 GB (4-bit), 240 GB (full precision)
Context length: 128K tokens
Speed: Medium
Multimodal: Yes

Best for:

Long context requirements
Batch processing multiple videos
Complex multi-frame analysis
Extended video sequences

Example:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "conference-video-123",
    "persona_id": "business-analyst-456",
    "frame_sample_rate": 2,
    "max_frames": 32
  }'

Qwen2.5-VL-72B

Model ID: Qwen/Qwen2.5-VL-72B-Instruct

Type: Dense model with 72B parameters

Characteristics:

VRAM: 36 GB (4-bit), 144 GB (full precision)
Context length: 32K tokens
Speed: Fast
Multimodal: Yes

Best for:

General-purpose summarization
Stable baseline results
Balanced speed and accuracy
Production deployments

Example:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "surveillance-footage-123",
    "persona_id": "security-analyst-456",
    "frame_sample_rate": 1,
    "max_frames": 10
  }'

Frame Sampling Strategies

Uniform Sampling (Default)

Extract frames at evenly spaced intervals.

Algorithm:

frame_indices = linspace(0, total_frames, frame_count)

Best for:

General video analysis
Unknown content structure
Consistent temporal coverage

Example:

{
  "sampling_strategy": "uniform",
  "frame_count": 8
}

For a 240-frame video, extracts frames: 0, 30, 60, 90, 120, 150, 180, 210

Keyframe-Based Sampling

Extract frames at scene changes and visual transitions.

Algorithm:

Detect scene changes using frame difference
Select frames at transition points
Fill remaining count with uniform samples

Best for:

Videos with distinct scenes
Presentations and slideshows
Action sequences with cuts

Example:

{
  "sampling_strategy": "keyframe",
  "frame_count": 12,
  "scene_threshold": 30.0
}

Parameters:

scene_threshold: Pixel difference threshold for scene detection (default: 30.0)

Adaptive Sampling

Dynamically allocate frames based on visual complexity.

Algorithm:

Calculate visual complexity per segment
Allocate more frames to complex segments
Fewer frames to static segments

Best for:

Mixed content (static and dynamic)
Efficient frame allocation
Unknown video characteristics

Example:

{
  "sampling_strategy": "adaptive",
  "frame_count": 16,
  "complexity_metric": "gradient"
}

Parameters:

complexity_metric: Metric for complexity (gradient, entropy, motion)

Persona Context Integration

Persona information is automatically included in the summarization prompt to provide domain-specific, context-aware summaries. This feature allows the same video to be summarized differently based on the analyst's perspective.

How It Works

When a summarization request includes a persona_id, the model service:

Fetches persona data from the backend (role, information need, description)
Constructs context-aware prompt including:
- Analyst role (e.g., "Sports Analyst", "Security Analyst")
- Information need (e.g., "Track player movements", "Identify security threats")
- Domain-specific vocabulary from the persona's ontology
Guides VLM generation to focus on relevant aspects of the video

Persona-Specific Summaries

Example: Same baseball video, different personas

Sports Analyst Persona:

Role: "Sports Analyst"
Information Need: "Track player performance and game statistics"

Generated Summary:

"The pitcher delivers a fastball at 94 mph. The batter swings and makes contact, sending the ball to deep left field. The left fielder tracks the ball and makes a diving catch at the warning track. The runner on first base tags up but holds at second base after the catch."

Security Analyst Persona:

Role: "Security Analyst"
Information Need: "Monitor crowd behavior and identify security concerns"

Generated Summary:

"Large crowd visible in the stadium stands. Multiple access points to the field are secured with barriers. Security personnel positioned at regular intervals along the perimeter. Camera coverage shows clear sight lines across the venue. No unusual crowd movements or gatherings detected."

Persona Data Structure

The persona context passed to the model includes:

{
  "role": "Sports Analyst",
  "informationNeed": "Track player performance metrics",
  "description": "Analyzing baseball games for player statistics",
  "ontologyContext": {
    "entityTypes": ["Player", "Ball", "Umpire"],
    "eventTypes": ["Pitch", "Hit", "Catch"]
  }
}

Benefits

Domain Relevance: Summaries focus on what matters to each analyst
Terminology Alignment: Uses vocabulary from persona's ontology
Reduced Noise: Filters out irrelevant details
Multi-Perspective Analysis: Same video analyzed from multiple viewpoints

Best Practices

Define clear roles: Specific roles produce better summaries than generic ones
Articulate information needs: Clear goals guide what the model focuses on
Build rich ontologies: More context helps the model understand domain vocabulary
Use consistent personas: Reuse personas across videos for comparable summaries

API Endpoint

Request

POST /api/summarize

Content-Type: application/json

Request Schema:

{
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "frame_sample_rate": 1,
  "max_frames": 30,
  "enable_audio": false,
  "audio_language": null,
  "enable_speaker_diarization": false,
  "fusion_strategy": "sequential"
}

Core Parameters:

Parameter	Type	Required	Default	Description
video_id	string	Yes	-	Unique identifier for the video
persona_id	string	Yes	-	Unique identifier for the persona
frame_sample_rate	integer	No	1	Frames to sample per second (1-10)
max_frames	integer	No	30	Maximum frames to process (1-100)

Audio Parameters (Optional):

Parameter	Type	Required	Default	Description
enable_audio	boolean	No	false	Enable audio transcription
audio_language	string \| null	No	null	Language code (e.g., "en", "es"). Auto-detects if null
enable_speaker_diarization	boolean	No	false	Enable speaker identification
fusion_strategy	string \| null	No	"sequential"	Audio-visual fusion strategy: "sequential", "timestamp_aligned", "native_multimodal", "hybrid"

See the Audio Transcription API for detailed audio parameter documentation.

Response

Status: 200 OK

Without Audio:

{
  "id": "summary-123",
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "summary": "The video shows a baseball game with a pitcher throwing from the mound. The batter prepares to hit. The pitch is thrown, and the batter swings, making contact.",
  "visual_analysis": "Frame 0: Pitcher on mound. Frame 150: Batter at plate...",
  "audio_transcript": null,
  "key_frames": [
    {
      "frame_number": 0,
      "timestamp": 0.0,
      "description": "Pitcher on mound",
      "confidence": 0.95
    }
  ],
  "confidence": 0.92,
  "transcript_json": null,
  "audio_language": null,
  "speaker_count": null,
  "audio_model_used": null,
  "visual_model_used": "llama-4-maverick",
  "fusion_strategy": null,
  "processing_time_audio": null,
  "processing_time_visual": 8.3,
  "processing_time_fusion": null
}

With Audio Enabled:

{
  "id": "summary-124",
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "summary": "The video shows a baseball game. The announcer describes the pitcher throwing a fastball as the batter prepares to swing.",
  "visual_analysis": "Frame 0: Pitcher on mound. Frame 150: Batter at plate...",
  "audio_transcript": "And here comes the pitch. It's a fastball, right down the middle.",
  "key_frames": [
    {
      "frame_number": 0,
      "timestamp": 0.0,
      "description": "Pitcher on mound",
      "confidence": 0.95
    }
  ],
  "confidence": 0.94,
  "transcript_json": {
    "segments": [
      {
        "start": 2.5,
        "end": 6.8,
        "text": "And here comes the pitch. It's a fastball, right down the middle.",
        "speaker": "Speaker 1",
        "confidence": 0.92
      }
    ],
    "language": "en",
    "speaker_count": 1
  },
  "audio_language": "en",
  "speaker_count": 1,
  "audio_model_used": "whisper-large-v3",
  "visual_model_used": "llama-4-maverick",
  "fusion_strategy": "sequential",
  "processing_time_audio": 12.5,
  "processing_time_visual": 8.3,
  "processing_time_fusion": 1.2
}

Core Response Fields:

Field	Type	Description
id	string	Unique identifier for this summary
video_id	string	Video identifier
persona_id	string	Persona identifier
summary	string	Generated text summary
visual_analysis	string \| null	Detailed visual content analysis
audio_transcript	string \| null	Transcribed audio content (if audio enabled)
key_frames	array	Key frames with descriptions and timestamps
confidence	float	Overall confidence score (0.0-1.0)

Audio Response Fields (when audio is enabled):

Field	Type	Description
transcript_json	object \| null	Structured transcript with segments and speakers
audio_language	string \| null	Detected or specified language code
speaker_count	number \| null	Number of distinct speakers identified
audio_model_used	string \| null	Audio transcription model name
visual_model_used	string \| null	Visual analysis model name
fusion_strategy	string \| null	Fusion strategy applied
processing_time_audio	number \| null	Audio processing time in seconds
processing_time_visual	number \| null	Visual processing time in seconds
processing_time_fusion	number \| null	Fusion processing time in seconds

Error Responses

400 Bad Request:

{
  "error": "Validation Error",
  "message": "frame_count must be between 1 and 64",
  "details": {
    "field": "frame_count",
    "value": 100,
    "constraint": "max:64"
  }
}

404 Not Found:

{
  "error": "Not Found",
  "message": "Video file not found: /data/missing.mp4"
}

500 Internal Server Error:

{
  "error": "Inference Error",
  "message": "Model inference failed: CUDA out of memory",
  "model": "llama-4-maverick",
  "frame_index": 45
}

Persona Context Integration

Persona context provides domain-specific knowledge to the model, improving summary relevance.

Without Persona Context

Request:

{
  "video_id": "game-video-123",
  "persona_id": "generic-analyst-456",
  "max_frames": 8
}

Response summary:

"The video shows people on a field. One person throws an object. Another person holds a stick."

With Persona Context

Request:

{
  "video_id": "game-video-123",
  "persona_id": "baseball-analyst-456",
  "max_frames": 8
}

(Persona has role="Baseball analyst" and information_need="tracking pitcher mechanics and pitch outcomes")

Response summary:

"The pitcher delivers a fastball from the stretch position with a three-quarters arm slot. The pitch crosses the plate at approximately 92 mph. The batter takes a full swing, making contact on the outer third of the plate. The result is a fly ball to center field."

Effective Persona Context

Good contexts:

"Medical researcher analyzing surgical procedures"
"Wildlife biologist studying animal behavior"
"Security analyst reviewing surveillance footage"
"Sports coach evaluating player technique"

Poor contexts:

"Person" (too generic)
"Expert" (vague)
"Someone who knows about this" (unclear domain)

Best practices:

Include domain expertise
Specify what to track or analyze
Use 5-15 words
Avoid generic terms

Performance Comparison

CPU vs GPU Processing

Model	CPU (frames/sec)	GPU T4 (frames/sec)	GPU A100 (frames/sec)
Gemma-3-27b	0.5	4	12
Qwen2.5-VL-72B	0.2	2	8
Llama-4-Maverick	0.1	1.5	6
InternVL3-78B	0.1	1	4
Pixtral-Large	0.1	1.5	6

Note: CPU inference suitable for development only. Production requires GPU.

Latency Breakdown

For 8-frame request with Qwen2.5-VL-72B on GPU T4:

Step	Duration	Percentage
Frame extraction	120 ms	6%
Frame preprocessing	80 ms	4%
Model loading (first request)	15,000 ms	-
Model inference	1,600 ms	85%
Response formatting	50 ms	3%
Network overhead	40 ms	2%
Total (cached model)	1,890 ms	100%

Frame Count Impact

Using Qwen2.5-VL-72B on GPU T4:

Frame Count	Inference Time	Tokens Generated
1	400 ms	20
4	850 ms	35
8	1,600 ms	48
16	3,100 ms	72
32	6,200 ms	110
64	12,500 ms	180

Inference time scales approximately linearly with frame count.

Use Cases and Limitations

When to Use Video Summarization

Initial video analysis: Generate overviews before detailed annotation
Content discovery: Identify videos of interest in large collections
Annotation context: Provide summaries to guide annotators
Quality control: Verify video content matches metadata
Search indexing: Create searchable text descriptions

Limitations

Temporal reasoning: Models analyze individual frames, not continuous motion
Small object detection: Details smaller than 5% of frame may be missed
Rare events: Models may not recognize uncommon objects or actions
Context limitations: 8-16 frames cannot capture hour-long narratives
Hallucinations: Models may describe plausible but incorrect details

Accuracy Expectations

Scenario	Expected Accuracy
General object identification	85-95%
Action recognition (common)	75-85%
Text extraction (OCR)	90-98%
Fine-grained classification	60-75%
Spatial relationships	70-85%
Temporal sequences	50-70%

Use summaries as starting points, not ground truth. Always verify critical details.

Troubleshooting

Summary Quality Issues

Symptom: Generic or uninformative summaries

Causes:

Insufficient frame count
Poor frame sampling strategy
Missing persona context

Solutions:

Increase frame count:

{
  "frame_count": 16  // Up from 8
}

Use keyframe sampling for scene-based videos:

{
  "sampling_strategy": "keyframe"
}

Add specific persona context:

{
  "persona_context": "Sports analyst tracking game events and player actions"
}

Slow Inference

Symptom: Summaries take 10+ seconds

Causes:

CPU mode (inherently slow)
Large model on limited GPU
High frame count

Solutions:

Switch to GPU mode (see Configuration)
Use faster model:

{
  "model": "gemma-3-27b"  // Fastest option
}

Reduce frame count:

{
  "frame_count": 4  // Down from 8
}

CUDA Out of Memory

Symptom: Error during inference

Causes:

Model too large for GPU
High frame count exceeds batch limit

Solutions:

Enable 4-bit quantization in config.
Reduce frame count:

{
  "frame_count": 4
}

Switch to smaller model:

{
  "model": "gemma-3-27b"  // Only 14GB VRAM
}

Missing Persona Context

Symptom: Generic summaries despite providing persona context

Cause: Persona context not applied correctly

Solutions:

Verify persona_applied in response:

{
  "persona_applied": true  // Should be true
}

Check persona context format (5-15 words, specific domain).
Review logs for context processing errors:

docker compose logs model-service | grep persona

Example Workflows

Workflow 1: Quick Video Overview

Generate a quick summary for initial review:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "new-video-123",
    "persona_id": "general-analyst-456",
    "frame_sample_rate": 1,
    "max_frames": 4
  }'

Workflow 2: Detailed Analysis with Audio

Get comprehensive summary with audio transcription:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "game-footage-123",
    "persona_id": "basketball-coach-456",
    "frame_sample_rate": 2,
    "max_frames": 16,
    "enable_audio": true,
    "enable_speaker_diarization": true,
    "fusion_strategy": "timestamp_aligned"
  }'

Workflow 3: OCR and Text Extraction

Extract text from presentation or document video:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "presentation-123",
    "persona_id": "document-analyst-456",
    "frame_sample_rate": 1,
    "max_frames": 12
  }'

Next Steps

Audio Transcription API - Add audio transcription to summaries
Audio-Visual Fusion Strategies - Learn about fusion approaches
Configure models for your hardware
Set up object detection
Enable video tracking
Use ontology augmentation
Return to overview

How It Works​

Available Models​

Llama-4-Maverick​

Gemma-3-27b​

InternVL3-78B​

Pixtral-Large​

Qwen2.5-VL-72B​

Frame Sampling Strategies​

Uniform Sampling (Default)​

Keyframe-Based Sampling​

Adaptive Sampling​

Persona Context Integration​

How It Works​

Persona-Specific Summaries​

Persona Data Structure​

Benefits​

Best Practices​

API Endpoint​

Request​

Response​

Error Responses​

Persona Context Integration​

Without Persona Context​

With Persona Context​

Effective Persona Context​

Performance Comparison​

CPU vs GPU Processing​

Latency Breakdown​

Frame Count Impact​

Use Cases and Limitations​

When to Use Video Summarization​

Limitations​

Accuracy Expectations​

Troubleshooting​

Summary Quality Issues​

Slow Inference​

CUDA Out of Memory​

Missing Persona Context​

Example Workflows​

Workflow 1: Quick Video Overview​

Workflow 2: Detailed Analysis with Audio​

Workflow 3: OCR and Text Extraction​

Next Steps​

How It Works

Available Models

Llama-4-Maverick

Gemma-3-27b

InternVL3-78B

Pixtral-Large

Qwen2.5-VL-72B

Frame Sampling Strategies

Uniform Sampling (Default)

Keyframe-Based Sampling

Adaptive Sampling

Persona Context Integration

How It Works

Persona-Specific Summaries

Persona Data Structure

Benefits

Best Practices

API Endpoint

Request

Response

Error Responses

Persona Context Integration

Without Persona Context

With Persona Context

Effective Persona Context

Performance Comparison

CPU vs GPU Processing

Latency Breakdown

Frame Count Impact

Use Cases and Limitations

When to Use Video Summarization

Limitations

Accuracy Expectations

Troubleshooting

Summary Quality Issues

Slow Inference

CUDA Out of Memory

Missing Persona Context

Example Workflows

Workflow 1: Quick Video Overview

Workflow 2: Detailed Analysis with Audio

Workflow 3: OCR and Text Extraction

Next Steps