Scaling Video Training with Parallelism
Long-video training changes the unit of distributed computation. A short video sample can belong to one GPU. A long video sample may already be too large, too irregular, or too objective-dependent for a single worker to own.
The previous post argued that video generation is becoming an infrastructure problem. This post zooms in on one specific infrastructure question: how do we train on a single video sequence that is too long for a single GPU, without breaking the semantics of the model, the modality, or the training objective?
The answer is not simply “use more GPUs.” Data parallelism gives more samples to more workers. Tensor parallelism splits matrix dimensions. Model parallelism, including pipeline-style layer splits, splits the model itself. For long videos, the painful dimension is often inside one sample: time, context length, visual tokens, latent chunks, masks, and loss-bearing targets.
Two recent systems make the idea concrete. LongVILA scales long-video understanding with Multi-Modal Sequence Parallelism (MM-SP), where the system first balances vision work over frames/images and then balances tokens for the LLM [1]. LongLive-2.0 scales long-video generation with Balanced SP, where the system assigns each GPU a temporally meaningful chunk that owns clean history, noisy target, VAE work, attention mask, and loss [2].
1. What sequence parallelism actually parallelizes
SP is the parallelism axis for the inside of one sample. Data parallelism splits samples; tensor and model parallelism split model computation. Sequence/context parallelism splits the time or context dimension itself.
This matters once a single video becomes hundreds or thousands of frames. LongVILA reports 1400-frame sequences at about 274K tokens and 2M-context training; LongLive-2.0 adds clean/noisy latents, VAE halos, masks, and target loss to the same long-sample problem [1] [2].
Short video training: many samples -> split the batch
Long video training: one sample -> split inside the sampleThe mental model is:
| Parallelism | What it splits | What it solves | Why it is not enough alone for long videos |
|---|---|---|---|
| Data parallelism / FSDP / ZeRO | Batch, parameters, gradients, optimizer state | Parameter and optimizer memory; throughput across samples | A single video sample can still be too long for one rank. |
| Tensor parallelism | Hidden dimensions, attention heads, MLP dimensions | Large matrix operations inside a layer | The sequence activation may still be too large. |
| Model / pipeline parallelism | Transformer layers or other model partitions | Deep models that do not fit on one device | Each stage or partition may still see the full sequence. |
| Sequence / context parallelism | Sequence, context, or time dimension | Long sequences whose activations and attention do not fit on one device | The shard must match modality, objective, masks, and hardware. |
In this post, SP means the broad form of sequence/context parallelism: partitioning the context, time, or token dimension of one sample across ranks. For long video, that partition must also respect modality, objective, masks, and hardware topology [4] [9].
2. A short map of SP systems
Before looking at video, it helps to place LongVILA and LongLive-2.0 on the SP map.
Sequence Parallelism from a system perspective. Li et al. framed sequence parallelism as a way to break the input sequence length limitation by splitting a long sequence into chunks and distributing those chunks across devices [3].
Megatron sequence parallelism and context parallelism. Megatron-style SP reduces activation memory and interacts naturally with tensor parallelism. Megatron Core’s later context parallelism generalizes the idea by partitioning the sequence dimension for network inputs and activations [4] [9].
DeepSpeed-Ulysses. Ulysses partitions input data along the sequence dimension and uses all-to-all communication during attention, which can be efficient when the number of attention heads supports the required partitioning [5].
Ring Attention. Ring Attention uses blockwise attention and ring communication of key-value blocks, letting devices stream KV chunks while computing local attention [6]. This makes context length scale naturally with the number of devices.
USP and LoongTrain. USP unifies Ulysses-style and Ring-style approaches into a broader sequence-parallel design space [7]. LoongTrain pushes the same direction toward 2D-Attention and head-context parallelism for long-sequence LLM training [8].
3. LongVILA: Multi-Modal SP for long-video understanding
LongVILA is a long-context VLM system for long-video understanding. Algorithmically, it extends the VILA training recipe with context extension and long-video supervised fine-tuning. System-wise, its key idea is Multi-Modal Sequence Parallelism, or MM-SP [1].
The headline numbers show the regime: LongVILA extends VILA from 8 video frames to 2048 frames and reports 99.8% accuracy in a 6000-frame needle-in-a-haystack evaluation, where the video can exceed 1M tokens [1].
Why text-only SP is not enough
Ring-style or text-centric SP can shard a token sequence. But in a VLM, the model does not begin with a clean token sequence. It begins with frames/images and text, then uses the vision encoder to manufacture visual tokens. If the system only shards after this point, the vision tower can remain imbalanced.
LongVILA’s MM-SP therefore uses a two-stage sharding strategy:
- Stage 1: shard by images or frames. Frames are distributed across SP ranks to balance the vision tower workload.
- Stage 2: shard by tokens. After visual embeddings and text are assembled, the resulting sequence is balanced across ranks for the LLM.
This is a small but important shift in perspective. The SP boundary moves earlier in the pipeline. The system does not wait until the LLM sees a long token sequence; it starts balancing from the moment video becomes visual work.
Communication should match the hardware topology
LongVILA also shows that SP is not only about slicing tensors. It is about choosing a communication pattern that matches the machine. The paper contrasts Ring-style SP, which relies on point-to-point communication, with MM-SP’s 2D-Attention design: intra-node All-to-All uses fast NVLink bandwidth, while inter-node P2P handles the slower cross-node path [1].
4. LongLive-2.0: Balanced SP for long-video generation
LongLive-2.0 attacks a different problem: long-video generation infrastructure. The full system includes NVFP4 training and inference, KV-cache compression, parallel dequantization, and asynchronous VAE decoding. For this post, the key training-side idea is Balanced SP [2].
The difference from LongVILA is important. In understanding, the challenge is multi-modal token production and long-context VLM execution. In generation, the challenge is that the physical sequence layout encodes the training objective.
The naive layout: clean everywhere, loss somewhere
The efficient teacher-forcing formulation builds a sequence like:
[ clean history latents ; noisy target latents ]
If ordinary SP slices this concatenated sequence without understanding the AR objective, some ranks may contain mostly clean context while another rank contains noisy target tokens that carry the loss. The sequence is partitioned, but the training work is not balanced.
The Balanced SP answer: same temporal chunk, same rank
Balanced SP changes the data layout. Each SP rank owns the clean latents and noisy latents from the same temporal chunk. That gives every rank both context tokens and target/loss tokens, and it lets the teacher-forcing attention mask be built naturally after the Ulysses-style All-to-All layout [2].
Traditional SP:
GPU0: clean z0
GPU1: clean z1
GPU2: clean z2
GPU3: noisy z3 + loss
Balanced SP:
GPU0: clean z0 + noisy z0 + local loss
GPU1: clean z1 + noisy z1 + local loss
GPU2: clean z2 + noisy z2 + local loss
GPU3: clean z3 + noisy z3 + local loss
SP starts before the transformer
The most interesting detail is that Balanced SP reaches before the DiT. Each rank VAE-encodes only its local raw-video chunk plus a left halo covering the VAE temporal receptive field, then discards the halo latent and keeps the local latent chunk [2].
This is a video-specific lesson. If the transformer is sharded but the VAE pipeline is replicated, the system has not actually made long-video training scale. SP must begin where the expensive sequence is created.
The performance lesson
LongLive-2.0 reports that NVFP4 plus Balanced SP is the fastest training configuration: for 16s, 32s, and 64s videos, iteration time is 40.1s, 119.3s, and 639.5s, giving 1.3×, 1.4×, and 2.1× speedups over the BF16+SP baseline. The paper also reports up to 2.15× training speedup, 1.84× inference speedup, and 45.7 FPS inference [2].
5. Differences between long video understanding and generation system
LongVILA and LongLive-2.0 both split inside a sample, but they assign ownership to different semantic units: multimodal tokens for understanding, and clean/noisy latent chunks for generation.
| Dimension | LongVILA | LongLive-2.0 |
|---|---|---|
| Task | Long-video understanding / VLM | Long-video generation / AR diffusion infrastructure |
| Sequence unit | Visual tokens plus text tokens | Video latent chunks |
| Main bottleneck | Vision encoder workload, LLM context length, attention communication | Clean/noisy layout, loss imbalance, VAE latent preparation, DiT activation |
| Why naive SP fails | Text-only token sharding ignores where visual tokens come from | Concatenated clean/noisy sharding can concentrate target/loss work |
| Core design | MM-SP: shard by frames/images, then by tokens | Balanced SP: each rank owns clean and noisy latents from the same temporal chunk |
| General lesson | Modality-aware ownership | Objective-aware temporal ownership |
The common abstraction is semantic ownership. A rank should not merely own a contiguous slice of a tensor. It should own a slice that makes the upstream encoder, the attention communication, the loss, and the hardware topology behave well together.
6. Design principles for long-video training system
Principle 1: Shard the real bottleneck
Do not split the easiest tensor; split the work that actually limits scale. For LongVILA, that includes frame/image encoding. For LongLive-2.0, it includes VAE preparation and target-loss ownership.
Principle 2: Preserve objective semantics
SP should not change temporal order, position identity, attention visibility, loss masks, or target ownership. The distributed layout must remain equivalent to the unsharded objective.
Principle 3: Match the hardware topology
Ring, Ulysses, 2D-Attention, USP, and LoongTrain differ mainly in how they communicate. A good video system chooses All-to-All, P2P, intra-node traffic, and inter-node traffic deliberately.
Principle 4: Start before the transformer
Video sequence construction begins before attention: frame loading, vision encoding, VAE encoding, latent chunking, and mask construction. If SP starts only inside transformer blocks, imbalance may already be baked in.
Principle 5: Check what each rank owns
Across samples? -> DP / FSDP / ZeRO
Inside one long sample? -> SP / context parallelism
Mostly text and enough attention heads? -> Ulysses-style SP
Need many nodes or beyond head limits? -> Ring / USP / 2D-Attention / LoongTrain-style SP
Heavy multimodal encoder work? -> MM-SP-style two-stage sharding
Clean/noisy AR video streams? -> Balanced-SP-style temporal ownershipThe final diagnostic is simple: after sharding, every rank should have meaningful work. If one rank owns the loss, every rank re-encodes the same video, or communication ignores the hardware topology, the layout is probably wrong.
Closing: The timeline is the new batch dimension
Long-video training breaks a basic assumption: one sample does not necessarily belong to one GPU. Once a sample becomes hundreds or thousands of frames, the system has to distribute work inside the sample itself.
But video also shows the limit of a generic SP story. LongVILA needs SP that understands modality and visual token production. LongLive-2.0 needs SP that understands clean/noisy latent streams, teacher forcing, VAE halos, and loss ownership.
In long-video training, the timeline is the new batch dimension. Sequence parallelism is how we scale it.
That is why SP belongs in the infrastructure stack for long video. A beautiful long-video demo proves capability. A well-designed SP system makes that capability trainable.
References
- LongVILA: Scaling Long-Context Visual Language Models for Long Videos. Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han. arXiv preprint, 2024. arXiv:2408.10188
- LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv preprint, 2026. arXiv:2605.18739
- Sequence Parallelism: Long Sequence Training from System Perspective. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You. Annual Meeting of the Association for Computational Linguistics (ACL), 2023 / arXiv preprint, 2021. arXiv:2105.13120
- Reducing Activation Recomputation in Large Transformer Models. Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro. MLSys, 2023 / arXiv preprint, 2022. arXiv:2205.05198
- DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. arXiv preprint, 2023. arXiv:2309.14509
- Ring Attention with Blockwise Transformers for Near-Infinite Context. Hao Liu, Matei Zaharia, Pieter Abbeel. arXiv preprint, 2023. arXiv:2310.01889
- USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. Jiarui Fang, Shangchun Zhao. arXiv preprint, 2024. arXiv:2405.07719
- LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu. arXiv preprint, 2024. arXiv:2406.18485
- Context Parallelism. NVIDIA Megatron Core Documentation. NVIDIA Docs