Scaling Video Training with Parallelism

📝 Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Weian Mao, Song Han
📅 June 3, 2026 ⏱️ 12 min read

Long-video training changes the unit of distributed computation. A short video sample can belong to one GPU. A long video sample may already be too large, too irregular, or too objective-dependent for a single worker to own.

The previous post argued that video generation is becoming an infrastructure problem. This post zooms in on one specific infrastructure question: how do we train on a single video sequence that is too long for a single GPU, without breaking the semantics of the model, the modality, or the training objective?

The answer is not simply “use more GPUs.” Data parallelism gives more samples to more workers. Tensor parallelism splits matrix dimensions. Model parallelism, including pipeline-style layer splits, splits the model itself. For long videos, the painful dimension is often inside one sample: the temporal/context sequence itself, whether represented as visual tokens or latents, plus the attention masks and loss-bearing targets that define the objective.

1. What sequence parallelism actually parallelizes

SP is the parallelism axis for the inside of one sample. Data parallelism splits samples; tensor and model parallelism split model computation. Sequence/context parallelism splits the time or context dimension itself.

This matters once a single video becomes hundreds or thousands of frames, but the bottleneck is not always just token count. LongVILA is a token-heavy example: 1400 frames become about 274K tokens, with training contexts up to 2M tokens. LongLive-2.0 is different: AR teacher forcing represents each temporal chunk in two streams, clean history and noisy target. If generic SP shards the concatenated clean/noisy sequence, some ranks can become clean-heavy while others carry the target loss, and VAE encoding can still be replicated. Balanced SP changes the ownership unit: each rank keeps the clean/noisy pair for the same temporal chunk and VAE-encodes only its local chunk plus a left halo [1] [2].

Short video training: many samples  -> split the batch
Long video training:  one sample    -> split inside the sample

The mental model is:

Figure 1. Generated animated diagram. Data parallelism splits samples, tensor parallelism splits hidden dimensions, model/pipeline parallelism splits model depth, and sequence/context parallelism splits inside a single sample.

Parallelism	What it splits	What it solves	Why it is not enough alone for long videos
Data parallelism / FSDP / ZeRO	Batch, parameters, gradients, optimizer state	Parameter and optimizer memory; throughput across samples	A single video sample can still be too long for one rank.
Tensor parallelism	Hidden dimensions, attention heads, MLP dimensions	Large matrix operations inside a layer	The sequence activation may still be too large.
Model / pipeline parallelism	Transformer layers or other model partitions	Deep models that do not fit on one device	Each stage or partition may still see the full sequence.
Sequence / context parallelism	Sequence, context, or time dimension	Long sequences whose activations and attention do not fit on one device	The shard must match modality, objective, masks, and hardware.

In this post, SP means the broad form of sequence/context parallelism: partitioning the context, time, or token dimension of one sample across ranks. For long video, that partition must also respect modality, objective, masks, and hardware topology [4] [9].

2. A short map of SP systems

Before looking at video, it helps to place LongVILA and LongLive-2.0 on the SP map.

Sequence Parallelism from a system perspective. Li et al. framed sequence parallelism as a way to break the input sequence length limitation by splitting a long sequence into chunks and distributing those chunks across devices [3].

Megatron sequence parallelism and context parallelism. Megatron-style SP reduces activation memory and interacts naturally with tensor parallelism. Megatron Core’s later context parallelism generalizes the idea by partitioning the sequence dimension for network inputs and activations [4] [9].

DeepSpeed-Ulysses. Ulysses partitions input data along the sequence dimension and uses all-to-all communication during attention, which can be efficient when the number of attention heads supports the required partitioning [5].

Ring Attention. Ring Attention uses blockwise attention and ring communication of key-value blocks, letting devices stream KV chunks while computing local attention [6]. This makes context length scale naturally with the number of devices.

USP and LoongTrain. USP unifies Ulysses-style and Ring-style approaches into a broader sequence-parallel design space [7]. LoongTrain pushes the same direction toward 2D-Attention and head-context parallelism for long-sequence LLM training [8].

The map gives us vocabulary, not the final answer. These systems answer how to distribute long transformer sequences. Long videos add another layer: the sequence is produced by a multimodal pipeline or by a structured generation objective.

3. SP for long-video understanding

LongVILA is a long-context VLM system for long-video understanding. Algorithmically, it extends the VILA training recipe with context extension and long-video supervised fine-tuning. System-wise, its key idea is Multi-Modal Sequence Parallelism, or MM-SP [1].

The headline numbers show the regime: LongVILA extends VILA from 8 video frames to 2048 frames and reports 99.8% accuracy in a 6000-frame needle-in-a-haystack evaluation, where the video can exceed 1M tokens [1].

Figure 2. Generated animated diagram based on LongVILA. MM-SP first balances the image/frame workload and then balances token workload; its 2D-Attention communication uses intra-node All-to-All and inter-node P2P. Source: LongVILA [1].

Why text-only SP is not enough

Ring-style or text-centric SP can shard a token sequence. But in a VLM, the model does not begin with a clean token sequence. It begins with frames/images and text, then uses the vision encoder to manufacture visual tokens. If the system only shards after this point, the vision tower can remain imbalanced.

LongVILA’s MM-SP therefore uses a two-stage sharding strategy:

Stage 1: shard by images or frames. Frames are distributed across SP ranks to balance the vision tower workload.
Stage 2: shard by tokens. After visual embeddings and text are assembled, the resulting sequence is balanced across ranks for the LLM.

This is a small but important shift in perspective. The SP boundary moves earlier in the pipeline. The system does not wait until the LLM sees a long token sequence; it starts balancing from the moment video becomes visual work.

Communication should match the hardware topology

LongVILA also shows that SP is not only about slicing tensors. It is about choosing a communication pattern that matches the machine. The paper contrasts Ring-style SP, which relies on point-to-point communication, with MM-SP’s 2D-Attention design: intra-node All-to-All uses fast NVLink bandwidth, while inter-node P2P handles the slower cross-node path [1].

LongVILA takeaway: for long-video understanding, SP has to become multi-modal SP. The system must know where visual tokens come from, not only where transformer tokens go.

SP for reinforcement learning. LongVILA-R1 extends the same SP idea to reinforcement learning, where one long video is reused across many rollouts plus policy/reference-model prefilling. Its Multi-modal Reinforcement Sequence Parallelism (MR-SP) first shards video frames across GPUs for rollout-time vision encoding, all-gathers and caches the resulting video embeddings for reuse, and then applies sequence-parallel prefilling to the long multimodal prefix for both policy and reference models. The key idea is still to move SP before the LLM, but the ownership target becomes the RL loop: frame encoding, cached embeddings, and prefix tokens are distributed so repeated rollouts do not repeatedly encode the full video. The paper reports up to 2.1x speedup on 512-frame RL training and scales to 1024 frames without OOM on a single 8xA100 node [10].

4. SP for long-video generation

LongLive-2.0 attacks a different problem: long-video generation infrastructure. The full system includes NVFP4 training and inference, KV-cache compression, parallel dequantization, and asynchronous VAE decoding. For this post, the key training-side idea is Balanced SP [2].

The difference from LongVILA is important. In understanding, the bottleneck is how frames/images become visual tokens and then a long VLM sequence. In LongLive-2.0, the bottleneck comes from AR teacher forcing: the logical sequence has clean-history and noisy-target streams. After Ulysses All-to-All, the implementation can construct the teacher-forcing mask in the resulting attention order, but it still needs the clean/noisy temporal identity to be preserved. Balanced SP is therefore not about manually pairing mask entries; it is about choosing an ownership unit where each rank keeps matched clean/noisy chunks and a balanced share of loss-bearing target work.

Animated Traditional SP diagram for LongLive-2.0 AR training — Figure 3a. Generated animated diagram based on LongLive-2.0. In traditional SP, VAE preparation is centralized and sharding over the concatenated clean/noisy sequence can leave loss-bearing target tokens on only a few ranks. Source: LongLive-2.0 [2].

Animated Balanced SP diagram for LongLive-2.0 AR training — Figure 3b. Generated animated diagram based on LongLive-2.0. Balanced SP assigns each GPU a temporal clean/noisy pair, so local VAE encoding, teacher-forcing mask construction in Ulysses order, and loss-bearing targets are distributed across ranks. Source: LongLive-2.0 [2].

The naive layout: clean-only ranks, target-heavy ranks

The efficient teacher-forcing formulation builds a sequence like:

[ clean history latents ; noisy target latents ]

If ordinary SP slices this concatenated sequence without understanding the AR objective, some ranks may contain mostly clean context while others contain loss-bearing noisy target tokens. The sequence is partitioned, but the training work is not balanced.

The Balanced SP answer: paired chunks on each rank

Balanced SP changes the data layout. Each SP rank locally constructs the clean latents and noisy latents from the same temporal chunk. That gives every rank both context tokens and loss-bearing target tokens. The teacher-forcing mask is then constructed in the Ulysses attention order from those clean/noisy identities, without materializing a separate global permutation [2].

Traditional SP:
GPU0: clean z0
GPU1: clean z1
GPU2: clean z2
GPU3: noisy z3 + loss

Balanced SP:
GPU0: clean z0 + noisy z0 + local loss
GPU1: clean z1 + noisy z1 + local loss
GPU2: clean z2 + noisy z2 + local loss
GPU3: clean z3 + noisy z3 + local loss

SP starts before the transformer

The most interesting detail is that Balanced SP reaches before the DiT. Each rank VAE-encodes only its local raw-video chunk plus a left halo covering the VAE temporal receptive field, then discards the halo latent and keeps the local latent chunk [2].

This is a video-specific lesson. If the transformer is sharded but the VAE pipeline is replicated, the system has not actually made long-video training scale. SP must begin where the expensive sequence is created.

The performance lesson

LongLive-2.0 reports that NVFP4 plus Balanced SP is the fastest training configuration: for 16s, 32s, and 64s videos, iteration time is 40.1s, 119.3s, and 639.5s, giving 1.3×, 1.4×, and 2.1× speedups over the BF16+SP baseline. The paper also reports up to 2.15× training speedup, 1.84× inference speedup, and 45.7 FPS inference [2].

LongLive-2.0 takeaway: for long-video generation, the right SP unit is not just tokens. It is a temporal chunk responsible for clean history, noisy target, local VAE encoding with a left halo, Ulysses-order teacher-forcing mask construction, and loss-bearing targets.

5. Differences between understanding and generation system

LongVILA and LongLive-2.0 both split inside a sample, but they assign ownership to different semantic units: multimodal tokens for understanding, and clean/noisy latent chunks for generation.

Dimension	LongVILA	LongLive-2.0
Task	Long-video understanding / VLM	Long-video generation / AR diffusion infrastructure
Sequence unit	Visual tokens plus text tokens	Video latent chunks
Main bottleneck	Vision encoder workload, LLM context length, attention communication	Clean/noisy layout, loss-bearing target imbalance, VAE latent preparation, DiT activation
Why naive SP fails	Text-only token sharding ignores where visual tokens come from	Concatenated clean/noisy sharding can leave some ranks target-heavy and others mostly clean-only
Core design	MM-SP: shard by frames/images, then by tokens	Balanced SP: each rank locally constructs matched clean/noisy latents from one temporal chunk
General lesson	Modality-aware ownership	Objective-aware temporal ownership

Animated diagram comparing semantic ownership in LongVILA and LongLive-2.0 — Figure 4. Generated animated diagram. The shared SP principle is to split inside a sample; the video-specific question is which semantic unit becomes the owner of a shard.

The common abstraction is semantic ownership. A rank should not merely own a contiguous slice of a tensor. It should own a slice that makes the upstream encoder, the attention communication, the loss, and the hardware topology behave well together.

6. Design principles for long-video training system

Principle 1: Shard the real bottleneck

Do not split the easiest tensor; split the work that actually limits scale. For LongVILA, that includes frame/image encoding. For LongLive-2.0, it includes VAE preparation and loss-bearing target distribution.

Principle 2: Preserve objective semantics

SP should not change temporal order, position identity, attention visibility, loss masks, or target ownership. The distributed layout must remain equivalent to the unsharded objective.

Principle 3: Match the hardware topology

Ring, Ulysses, 2D-Attention, USP, and LoongTrain differ mainly in how they communicate. A good video system chooses All-to-All, P2P, intra-node traffic, and inter-node traffic deliberately.

Principle 4: Start before the transformer

Video sequence construction begins before attention: frame loading, vision encoding, VAE encoding with chunk halos, latent chunking, and mask construction. If SP starts only inside transformer blocks, imbalance may already be baked in.

Principle 5: Check what each rank owns

Across samples?                         -> DP / FSDP / ZeRO
Inside one long sample?                 -> SP / context parallelism
Mostly text and enough attention heads? -> Ulysses-style SP
Need many nodes or beyond head limits?  -> Ring / USP / 2D-Attention / LoongTrain-style SP
Heavy multimodal encoder work?          -> MM-SP-style two-stage sharding
Clean/noisy AR video streams?           -> Balanced-SP-style temporal ownership

The final diagnostic is simple: after sharding, every rank should have meaningful work. If one rank owns the loss, every rank re-encodes the same video, or communication ignores the hardware topology, the layout is probably wrong.

Closing: The timeline is the new batch dimension

Long-video training breaks a basic assumption: one sample does not necessarily belong to one GPU. Once a sample becomes hundreds or thousands of frames, the system has to distribute work inside the sample itself.

This is where a generic “split the sequence” SP story stops being enough. LongVILA cannot treat the input as only a long token list; the split has to respect how frames/images become visual tokens. LongLive-2.0 cannot simply shard the concatenated clean/noisy sequence; the split has to preserve each temporal chunk's clean history, noisy target, left halo for VAE encoding, and loss-bearing tokens.

In long-video training, the timeline is the new batch dimension. Sequence parallelism is how we scale it.

That is why SP belongs in the infrastructure stack for long video. A beautiful long-video demo proves capability. A well-designed SP system makes that capability trainable.

References

LongVILA: Scaling Long-Context Visual Language Models for Long Videos. Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han. arXiv preprint, 2024. arXiv:2408.10188
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv preprint, 2026. arXiv:2605.18739
Sequence Parallelism: Long Sequence Training from System Perspective. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You. Annual Meeting of the Association for Computational Linguistics (ACL), 2023 / arXiv preprint, 2021. arXiv:2105.13120
Reducing Activation Recomputation in Large Transformer Models. Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro. MLSys, 2023 / arXiv preprint, 2022. arXiv:2205.05198
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. arXiv preprint, 2023. arXiv:2309.14509
Ring Attention with Blockwise Transformers for Near-Infinite Context. Hao Liu, Matei Zaharia, Pieter Abbeel. arXiv preprint, 2023. arXiv:2310.01889
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. Jiarui Fang, Shangchun Zhao. arXiv preprint, 2024. arXiv:2405.07719
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu. arXiv preprint, 2024. arXiv:2406.18485
Context Parallelism. NVIDIA Megatron Core Documentation. NVIDIA Docs
Scaling RL to Long Videos. Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han. NeurIPS, 2025 / arXiv preprint, 2025. arXiv:2507.07966