Hierarchical dense action annotation pipeline (Chen et al., 2026 — Action100M) ported to 50 EgoSchema 3-minute egocentric clips. Each clip is segmented by V-JEPA 2 + Ward agglomerative clustering, captioned by Llama-3.2-Vision + Perception-LM, and aggregated by GPT-4o with 3-round Self-Refine. Click any card to open an interactive viewer with the source video, hierarchical timeline, and per-node annotations.
A person shapes mud into bricks using a wooden mold.
A person crafts a paper flower at a small table.
A woman packs groceries and interacts with a customer.
A person repairs a scooter in a workshop.
A person cleans the kitchen and explores the house.
A man prepares scrambled eggs in a messy kitchen.
A woman plays cards while feeding a lizard.
A person exercises and makes coffee.
A person is crafting bricks using clay and sand.
A person cuts and peels dried fruits at a table.
Two men engage in card playing and note-taking at a table.
A person organizes groceries and cleans the kitchen.
A person cleans and examines books on the floor.
A person prepares and cooks a dish.
A woman cuts yellow fabric with scissors.
A person prepares and cooks a meal in a kitchen.
A man welds and smooths a metal pipe.
A person navigates a house, interacting with items and observing surroundings.
A person assembles a wooden project using glue and small blocks.
A person controls a robot vacuum while another cleans the kitchen and someone else uses a phone.
A woman irons various fabrics on an ironing board.
A person is knitting at a table with various items.
A person organizes clothes by taking them out of a wardrobe and placing them on a bed.
A man paints a wooden door and board yellow.
A person washes dishes in a kitchen sink.
A person cooks and prepares a meal in a cluttered kitchen.
A lab technician conducts experiments using pipettes and test tubes.
A person sews a small pouch using a sewing machine.
A person crafts a clay sculpture at a table.
Two people play a game of checkers on a wooden table.
A person washes clothes in a bathtub while intermittently watching a video on their phone.
A person washes dishes at a kitchen sink.
A woman crafts clay pots on the ground.
A person organizes and cleans books on the floor.
A person cuts and prepares cardboard for a project.
A person prepares a meal by adding milk and water to a pot and organizing kitchen items.
A person cooks and cleans in the kitchen.
A person photographs a field and interacts with a group.
A person creates a craft project at a table.
A person is gardening by weeding and planting in a raised bed and pot.
A person prepares a meal in a modern kitchen.
A man works on a woodworking project in a workshop.
A person walks through a house, brushes their teeth, and exits the bathroom.
A person repots a plant using a trowel and soil bag.
A person folds a cloth and transfers items between bags.
A man sands and polishes a metal pipe using power tools.
Two people work together to complete a 1000-piece emoji jigsaw puzzle.
A person knits a purple item in a living room.
A person prepares potatoes in the kitchen.
A person cleans windows in a room.
Stage 1. V-JEPA 2 ViT-g-384 frame embeddings (window=64, stride=8, res=384²) → temporal-contiguous Ward agglomerative clustering → ~600+ tree nodes per clip.
Stage 2. Leaf nodes captioned by Llama-3.2-11B-Vision on midpoint frame;
internal nodes captioned by Perception-LM-3B on 32 evenly-spaced frames at 320×320.
Stage 3. Nodes ≥4s aggregated by gpt-4o-2024-08-06
using global-tree-context + current-subtree-markdown, with 3-round Self-Refine and JSON-Schema-strict
structured outputs ({summary, action}).
Full method documentation:
README ·
Source code (private):
streaming_benchmark/src/{stage1,stage2,stage3}_*.py