Integrating SAM2 and Vision Models for Automated Video Reporting Dashboards

Tracking objects across hours of raw CCTV or drone footage used to be a manual nightmare. Last month, I had to build a pipeline that turns raw video streams into structured, queryable reports. The goal was simple: identify specific entities (like construction machinery or retail customers), track them across frames, and push the metadata into a dashboard—all without human intervention.

Meta’s SAM2 (Segment Anything Model 2) changed the game for me here. Unlike the original SAM, which was strictly image-based, SAM2 handles video by maintaining a memory bank of object states. By pairing this with a lightweight detection model like YOLOv11, I could automate the tracking process with high precision.

The Architectural Blueprint

I structured the system into three distinct stages:

Detection (The Trigger): I use YOLOv11 to run inference on keyframes (every 30th frame). This keeps compute costs low.
Tracking (The Memory): I feed the YOLO detections as "prompts" into SAM2. SAM2 then propagates these masks across the intermediate frames using its internal memory state.
Reporting (The Dashboard): I extract the mask coordinates, convert them to bounding box centroids, and push the telemetry to a PostgreSQL instance (using PostGIS for spatial queries) and a React dashboard.

Implementation: The Tracking Loop

The core logic revolves around initializing the SAM2 predictor and feeding it the detection prompts. Here is how I set up the tracking loop in Python.

import torch
from sam2.build_sam import build_sam2_video_predictor

# Load the model - I use the 'hiera_l' checkpoint for a balance 
# between speed and VRAM usage on A100s.
checkpoint = "./checkpoints/sam2_hiera_large.pt"
model_cfg = "sam2_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)

def run_tracking(video_path, detections):
    # detections: dict where keys are frame indices and values are box coordinates
    inference_state = predictor.init_state(video_path)
    
    for frame_idx, boxes in detections.items():
        # Convert boxes to torch tensors
        box_tensor = torch.tensor(boxes, dtype=torch.float32)
        
        # Add the detection as a prompt to SAM2
        _, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
            inference_state,
            frame_idx=frame_idx,
            box=box_tensor,
            obj_id=1 # Assuming single object for this example
        )

    # Propagate the mask through the video
    video_segments = predictor.propagate_in_video(inference_state)
    
    # Extract mask centroids for the reporting dashboard
    report_data = []
    for frame_idx, out_obj_ids, out_mask_logits in video_segments:
        mask = (out_mask_logits > 0.0).cpu().numpy()
        centroid = calculate_centroid(mask)
        report_data.append({"frame": frame_idx, "pos": centroid})
        
    return report_data

Operational Trade-offs

I learned the hard way that SAM2 is memory-hungry. Running this on 4K footage will crash your VRAM if you try to process the whole video at once.

Chunking: I split long videos into 5-minute chunks. This keeps the memory state manageable and allows for parallel processing across multiple GPU nodes.
Prompt Frequency: You don't need to prompt SAM2 on every frame. If your object moves predictably, prompting every 60 frames is usually enough. If the object moves erratically, I trigger an auto-re-detection via YOLO to "reset" the mask.
Drift: If the object disappears behind an obstacle, SAM2 might lose the ID. I implemented a simple Kalman Filter to bridge these gaps. If the mask vanishes, I use the Kalman predicted position to search the next frame until the object reappears.

Debugging Tips

If your dashboard shows erratic tracking labels, check your input normalization. SAM2 expects boxes in [x1, y1, x2, y2] format. I spent three hours debugging a "jumping" mask, only to realize I was passing [x, y, w, h] from my YOLO output.

Also, keep an eye on the mask_logits. If the values are consistently near zero, your detection prompt is likely too loose (including too much background). I usually apply a 0.85 confidence threshold on the YOLO detections before sending them to SAM2; this significantly reduces false positives.

By offloading the segmentation to SAM2, I’ve moved from manual data entry to a system that generates reliable, time-series data for my clients. It’s not perfect—occlusions still require custom logic—but it’s the most robust pipeline I’ve built for video analytics to date.

The Architectural Blueprint

Implementation: The Tracking Loop

Operational Trade-offs

Debugging Tips

Aditya Shenvi