SAM Series (for images and videos)

Introduction

SAM is a big collection of models published by Meta AI and it aims to segment components from contents. It originally started with image segmentation but gradually expanded to video, 3D geometry (opens in a new tab) and audio (opens in a new tab).

In this post, we will focus on the image and video segmentation models. Segmentation is a quite classic task in computer vision and it has been extensively studied for decades. But SAM series models revolutionized the field in following ways:

From expert model to foundation model: Traditional segmentation models are usually trained for a specific task or dataset. SAM introduced a "Promptable Segmentation" task and does not require any fine-tuning for zero-shot segmentation.
Unprecedented dataset scale and strong generalization ability: Previous models are usually trained on about ~100k images and ~1M masks. SAM v1 is trained on 11M images and 1.1B masks, which exceeds the scale of previous datasets by more than 100x.

The SAM series models are evolving over time and here is a summary of the key facts:

Version	Goal / Task	Key Contributions
SAM (Segment Anything)	Promptable Segmentation Task To return a valid segmentation mask given any prompt (point, box, mask, text) on a static image. Aimed at zero-shot generalization to new distributions.	• Model: Proposed the SAM architecture (Heavy Image Encoder + Light Prompt Encoder/Mask Decoder) for real-time interaction. • Data: Created SA-1B dataset (11M images, 1.1B masks) using a 3-stage data engine (Manual $\to$ Semi-Auto $\to$ Fully Auto).
SAM 2	Promptable Visual Segmentation (PVS) To unify image and video segmentation. Extends the "segment anything" capability to video by maintaining temporal consistency and handling object occlusions/re-appearance.	• Model: Introduced a streaming architecture with Memory Attention and a Memory Bank to condition outputs on past frames. • Data: Created SA-V dataset (51K videos, 642K masklets) using a model-in-the-loop data engine. • Performance: Faster (6x speedup over SAM) and more accurate in both image and video domains.
SAM 3	Promptable Concept Segmentation (PCS) To segment all instances matching a specific concept (text phrase or visual exemplar) in images and videos. Moving from "segment one specific target" to "segment all matching targets".	• Model: Introduced a decoupled architecture with a Detector (for recognition) and Tracker (from SAM 2), plus a Presence Head to separate recognition from localization. • Data: Created SA-Co dataset (1.7B image-concept pairs) with 4M+ unique concepts specific to semantic understanding.

SAM v1: Promptable Segmentation and Scalable Data Generation

Define the task

To serve as a fundation model in image segmentation, it should have strong capability of generalization. For this purpose, the authors defined a "Promptable Segmentation" task: given an image and a prompt, the model should return the segmentation result.

SAM produces high-quality masks for all objects in an image.

To achive this goal, there two main issues to resolve:

What is the architecture of the model? It should be able to process both the image and the prompts and efficiently interact with the prompts.
How to generate sufficient high quality training data and train the model? High quality training data is critical and very scarce before this work. This work designed a model-in-the-loop data engine and iteratively generate data and train the model.
How to deal with the ambiguity of the prompts? A point on a human face can mean both the face or the entire body. To deal with this ambiguity, the authors proposed to predict multiple candidate masks for each prompt.

SAM v1 model architecture

Remarks:

Image Encoder module:
- At the training time, the parameters in the image encoder are learnable. Based on empirical observation of MAE-pretrained ViT-H/16, finetuning(make the parameter learnable) can yield much better performance than linear probing(freeze the parameters and use encoder as fixed feature extractor).
Prompt Encoder module:
- For text, the authors use a frozen CLIP Text Encoder for zero-shot text-to-mask transfer.
  - Training: Since SA-1B has no text, they use the CLIP Image Encoder (frozen) on random crops around masks to generate "pseudo-text" prompt embeddings.
  - Inference: They use the CLIP Text Encoder (frozen) on the input text. Because CLIP aligns image and text spaces, this works zero-shot without explicit text supervision.
Mask Decoder module:
- There are 4 learned tokens (1 IoU + 3 Mask) that act as "queries". Initially identical for all inputs, they gather information from the image and prompts via the Transformer's attention mechanism. They can be percieved as [CLS] tokens or registers in ViT. I did not find any part in the paper that indicates that each of this output token is represent "query" to a specific output mask. For example, learned output token 1 represent query for "whole object" mask. Therefore, I think it does not matter a lot if there is 3, 4, 5, or even 6 output tokens.
- The transformer module inside the mask decoder combines the self-attention and cross-attention. My understanding about the architure is this (opens in a new tab).

Data Engine and training recipe

The final dataset the paper generated has more than 1.1B masks and it is impossible to manually label all of them (napkin math: 1.1B* 50s per mask(estimated in the paper) = 1500 years). In this paper, the authors proposed a model-in-the-loop(MITL) style training strategy to iteratively train the model and generate dataset. It started the training with some public segmentation datasets (the paper did not specify, but I assume it is COCO, LVIS, etc) and obtained an very initial version of SAM. Then iteratively generate masks, annotate them with human annotators, and retrain the model. Detailed process is shown in the figure below.

Stage 1: Assisted-Manual

Focus: Human-in-the-loop

Input: Initial SAM trained on public datasets

Process:

Professional annotators use browser-based tool
Click fg/bg points → SAM predicts mask
Refine with brush/eraser

Feedback: Retrain SAM 6 times (ViT-B → ViT-H)

Output: 4.3M masks / 120K images

Stage 2: Semi-Automatic

Focus: Diversity

Input: SAM + Box Detector

Process:

Auto-detect confident objects
Annotators add additional objects

Feedback: Retrain SAM 5 times

Output: 5.9M new masks / 180K images (10.2M total)

Stage 3: Fully Automatic

Focus: Scale (Zero human effort)

Input: SAM with ambiguity-aware output

Process:

32×32 point grid prompt
Predict 3 masks/point
Filter by IoU/Stability/NMS

Feedback: None

Output: 1.1B masks / 11M images

SAM v2: Unified Video and Image Segmentation

SAM 2 unifies image and video segmentation with a single model.

Define the task

Moving from static images to dynamic videos, the authors introduced the Promptable Visual Segmentation (PVS) task. The goal is to extend the "Promptable Segmentation" capability to the temporal dimension:

Input: Prompts (clicks, boxes, masks) on any frame of a video.
Output: A valid spatio-temporal masklet (a 3D mask across time) for the target object.
Goal: Consistent across different frames and flexible for interactive. It should be able to handle a variety of challenges in video segmentation, such as object appearance changes, occlusions, and background clutter.

PVS unifies image and video segmentation because an image is simply a 1-frame video.

The PVS task requires segmenting an object across video frames given prompts on any frame.

SAM v2 model architecture

The core challenge in video is Memory: "What did the object look like in previous frames?" and "Where was it?". SAM v2 introduces a streaming memory mechanism to solve this.

SAM 2 Architecture with Streaming Memory.

Remarks:

Hiera Backbone: SAM v2 uses Hiera, a hierarchical Vision Transformer, instead of the plain ViT in SAM v1. This allows for multiscale features and is significantly faster, enabling real-time video processing.
Streaming Memory:
- Spatial Memory: The model stores "fused" feature maps of the image and the predicted mask from past frames. This captures "where" the object was and "what" it looked like.
- Object Pointers: Lightweight vectors (from the mask decoder output tokens) that capture high-level semantic information (the "identity" of the object) to prevent drifting to similar looking distractors.
- Memory Bank: A FIFO queue stores the last $N$ frames of spatial memory and up to $M$ "prompted" frames (frames where users explicitly interacted).
Occlusion Handling: A new head predicts whether the object is visible (occluded) in the current frame. If occluded, the model can "skip" predicting a mask but still maintain memory.

Data Engine: SA-V Dataset

To train this video foundation model, the authors built the SA-V dataset (~51K videos, ~643K masklets). They used a similar Model-in-the-Loop scheme as SAM v1, but adapted for video.

Comparison of manual annotation (left) vs. Model-assisted annotation (right).

Phase 1: SAM per frame

Method: Brute force

Annotators use SAM v1 to segment object in every single frame.
No temporal tracking assistance.
Very slow (37.8s / frame) but high quality.

Phase 2: SAM + SAM 2 Mask

Method: Temporal Propagation

Annotators segment one frame with SAM v1.
SAM 2 Mask (accepts only mask prompts) propagates prediction to other frames.
Refine errors by re-segmenting frames from scratch.
5.1x faster.

Phase 3: SAM 2

Method: Fully Interactive

Annotators use the full SAM 2.
Click/Box on any frame to valid/correct the track.
Memory mechanism handles consistency.
8.4x faster than Phase 1.

SAM v3: Segment Anything with Concepts

Define the task

While SAM v1/v2 excel at "Segment this object" (where this is defined by a click), they struggle with "Segment all cats". SAM v3 introduces Promptable Concept Segmentation (PCS).

Goal: Segment all instances matching a specific concept.
Input:
- Text: "Yellow school bus", "Person wearing red shirt".
- Visual Exemplar: A box around one instance of the object.
Output: Masks and unique IDs for all matching objects in the image or video.

Promptable Concept Segmentation: Segmenting all instances of a concept defined by text or visual exemplar.

This bridges the gap between Generic Segmentation (SAM) and Open-Vocabulary Detection (OWL-ViT, GroundingDINO).

SAM v3 model architecture

The core innovation is "Shared Eyes, Split Brain". A single image backbone supports two distinct tasks: Detection (Category-aware, Identity-agnostic) and Tracking (Identity-aware, Category-agnostic).

SAM 3 "Shared Eyes, Split Brain" Architecture.

Remarks:

Shared Perception Encoder: Uses a high-performance vision-language backbone (Bolya 2025) that is effective for both tasks.
The "Presence Head":
- A classic problem in open-vocabulary detection is Hallucination. If you ask a model to find "dragons" in an empty room, standard detectors force their queries to find the "most dragon-like" patch, often yielding false positives.
- SAM v3 adds a Global Presence Token that first predicts: "Is the concept 'dragon' present in this image at all?".
- The final score for an object is $P (Object ∣ Present) \times P (Present)$ . If the Presence Head says "No", all object scores drop to zero.
Unified Logic: The Detector finds objects, and the Tracker (inherited from SAM 2) propagates them through time.

Data Engine: SA-Co Dataset

To solve the scale and ambiguity of concepts, Meta built the SA-Co (Segment Anything with Concepts) dataset (~4M concepts). The data engine focuses heavily on AI Verification to scale up.

The SA-Co Data Engine pipeline, featuring human and AI verification stages.

Phase 1: Human Verification

Input: Noisy image-text pairs.
Action: Human annotators verify mask quality and exhaustivity (did we catch all instances?).
Outcome: ~4.3M high-quality pairs. Used to train initial models.

Phase 2: AI Verification

Innovation: Replaced humans with Llama 3.2.
Fine-tuned Llama on Phase 1 data to act as a "Verifier".
It rates masks as "Correct/Incorrect" and checks exhaustivity.
Result: Doubles the throughput of the data engine.

Phase 3: Scaling & Ontology

Expansion: Mining concepts from image alt-text and a massive Wikidata knowledge graph.
Introduction of Hard Negatives (e.g., if prompting for "Lion", ensure the model doesn't click "Tiger").

Phase 4: Video

Extends the engine to video by sampling frames, annotating them, and propagating with SAM v2.
Prioritizes crowded scenes and motion to capture hard tracking cases.

References

SAM: arXiv:2304.02643 (opens in a new tab) - Segment Anything
SAM 2: arXiv:2408.00714 (opens in a new tab) - Segment Anything Model 2
SAM 3: arXiv:2511.16719 (opens in a new tab) - Segment Anything Model 3

Comments Loading (Debug)...

Blog Further thinking on DINOv3