Scaling Video Training with Parallelism

📝 Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Weian Mao, Song Han
📅 June 3, 2026 ⏱️ 12 min read

Long-video training changes the unit of distributed computation. A short video sample can belong to one GPU. A long video sample may already be too large, too irregular, or too objective-dependent for a single worker to own.

The previous post argued that video generation is becoming an infrastructure problem. This post zooms in on one specific infrastructure question: how do we train on a single video sequence that is too long for a single GPU, without breaking the semantics of the model, the modality, or the training objective?

The answer is not simply “use more GPUs.” Data parallelism gives more samples to more workers. Tensor parallelism splits matrix dimensions. Model parallelism, including pipeline-style layer splits, splits the model itself. For long videos, the painful dimension is often inside one sample: time, context length, visual tokens, latent chunks, masks, and loss-bearing targets.

Two recent systems make the idea concrete. LongVILA scales long-video understanding with Multi-Modal Sequence Parallelism (MM-SP), where the system first balances vision work over frames/images and then balances tokens for the LLM [1]. LongLive-2.0 scales long-video generation with Balanced SP, where the system assigns each GPU a temporally meaningful chunk that owns clean history, noisy target, VAE work, attention mask, and loss [2].

Longer videos need sequence parallelism Short videos fit on one GPU. Longer videos create more tokens. Very long videos create too many tokens for one GPU, so sequence parallelism splits the sequence across GPUs. Short video ... Few tokensFits on one GPU Longer video ... ...More tokensHigher memory and compute Very long video ... ...Too many tokensDoes not fit on one GPU Sequence Parallelism(SP) ++... Split the sequenceacross GPUsBalance memory, compute,and workload

1. What sequence parallelism actually parallelizes

SP is the parallelism axis for the inside of one sample. Data parallelism splits samples; tensor and model parallelism split model computation. Sequence/context parallelism splits the time or context dimension itself.

This matters once a single video becomes hundreds or thousands of frames. LongVILA reports 1400-frame sequences at about 274K tokens and 2M-context training; LongLive-2.0 adds clean/noisy latents, VAE halos, masks, and target loss to the same long-sample problem [1] [2].

Short video training: many samples  -> split the batch
Long video training:  one sample    -> split inside the sample

The mental model is:

Parallelism axes: what does each GPU own? An animated comparison of data parallelism, tensor parallelism, model/pipeline parallelism, and sequence parallelism. The cards appear one by one. Parallelism axes: what does each GPU own? The usual axes split batches, parameters, hidden dimensions, or layers. SP splits inside one sample. Data Parallelism sample A sample B Split the batch. Tensor Parallelism H/2 H/2 Split heads or hidden dims. Model Parallelism pipeline-style layer split L1 L2 L3 Split model depth. Sequence Parallelism t0 t1 t2 t3 Split context / time.
Figure 1. Generated animated diagram. Data parallelism splits samples, tensor parallelism splits hidden dimensions, model/pipeline parallelism splits model depth, and sequence/context parallelism splits inside a single sample.
ParallelismWhat it splitsWhat it solvesWhy it is not enough alone for long videos
Data parallelism / FSDP / ZeROBatch, parameters, gradients, optimizer stateParameter and optimizer memory; throughput across samplesA single video sample can still be too long for one rank.
Tensor parallelismHidden dimensions, attention heads, MLP dimensionsLarge matrix operations inside a layerThe sequence activation may still be too large.
Model / pipeline parallelismTransformer layers or other model partitionsDeep models that do not fit on one deviceEach stage or partition may still see the full sequence.
Sequence / context parallelismSequence, context, or time dimensionLong sequences whose activations and attention do not fit on one deviceThe shard must match modality, objective, masks, and hardware.

In this post, SP means the broad form of sequence/context parallelism: partitioning the context, time, or token dimension of one sample across ranks. For long video, that partition must also respect modality, objective, masks, and hardware topology [4] [9].

2. A short map of SP systems

Before looking at video, it helps to place LongVILA and LongLive-2.0 on the SP map.

Sequence Parallelism from a system perspective. Li et al. framed sequence parallelism as a way to break the input sequence length limitation by splitting a long sequence into chunks and distributing those chunks across devices [3].

Megatron sequence parallelism and context parallelism. Megatron-style SP reduces activation memory and interacts naturally with tensor parallelism. Megatron Core’s later context parallelism generalizes the idea by partitioning the sequence dimension for network inputs and activations [4] [9].

DeepSpeed-Ulysses. Ulysses partitions input data along the sequence dimension and uses all-to-all communication during attention, which can be efficient when the number of attention heads supports the required partitioning [5].

Ring Attention. Ring Attention uses blockwise attention and ring communication of key-value blocks, letting devices stream KV chunks while computing local attention [6]. This makes context length scale naturally with the number of devices.

USP and LoongTrain. USP unifies Ulysses-style and Ring-style approaches into a broader sequence-parallel design space [7]. LoongTrain pushes the same direction toward 2D-Attention and head-context parallelism for long-sequence LLM training [8].

The map gives us vocabulary, not the final answer. These systems answer how to distribute long transformer sequences. Long videos add another layer: the sequence is produced by a multimodal pipeline or by a structured generation objective.
Why video is not just longer textVideo understanding produces tokens through a vision encoder. Video generation makes sequence layout part of the objective. ?Why video is not just longer text? Video understanding:tokens are producedVideo frames Vision encoderVisual tokens...LLMbalance encoding + token sharding Video generation:layout is the objectiveClean history...Noisy target chunks...loss on noisy sideLLM Video SP must respect token origin and token meaning.

3. LongVILA: Multi-Modal SP for long-video understanding

LongVILA is a long-context VLM system for long-video understanding. Algorithmically, it extends the VILA training recipe with context extension and long-video supervised fine-tuning. System-wise, its key idea is Multi-Modal Sequence Parallelism, or MM-SP [1].

The headline numbers show the regime: LongVILA extends VILA from 8 video frames to 2048 frames and reports 99.8% accuracy in a 6000-frame needle-in-a-haystack evaluation, where the video can exceed 1M tokens [1].

LongVILA Multi-Modal Sequence Parallelism animation Animated diagram showing LongVILA MM-SP: baseline Ring SP, two-stage image then token sharding, and topology-aware communication. LongVILA MM-SP: two-stage sharding MM-SP first balances image/frame work, then re-shards visual and text tokens for the LLM. image tokens text tokens P2P All-to-All Two-stage sharding strategy Baseline: Ring SPTokens are split, but the vision encoder workload is not balanced. GPU 0 350 text GPU 1 350 text P2P KV MM-SP: shard by images, then by tokens GPU 0 input <img> <img> GPU 1 input <img> <img> 300 text Stage 1: by images 100 100 100 100 300 text Stage 2: by tokens GPU 0 final shard 100 100 100 50 GPU 1 final shard 50 300 text Topology-aware communication Ring SP: P2P everywhere node 0 GPU GPU GPU GPU node 1 GPU GPU GPU GPU P2P MM-SP: 2D-AttentionAll-to-All inside each node, P2P across nodes. GPU GPU GPU GPU GPU GPU GPU GPU intra-node A2A intra-node A2A inter-node P2P
Figure 2. Generated animated diagram based on LongVILA. MM-SP first balances the image/frame workload and then balances token workload; its 2D-Attention communication uses intra-node All-to-All and inter-node P2P. Source: LongVILA [1].

Why text-only SP is not enough

Ring-style or text-centric SP can shard a token sequence. But in a VLM, the model does not begin with a clean token sequence. It begins with frames/images and text, then uses the vision encoder to manufacture visual tokens. If the system only shards after this point, the vision tower can remain imbalanced.

LongVILA’s MM-SP therefore uses a two-stage sharding strategy:

  1. Stage 1: shard by images or frames. Frames are distributed across SP ranks to balance the vision tower workload.
  2. Stage 2: shard by tokens. After visual embeddings and text are assembled, the resulting sequence is balanced across ranks for the LLM.

This is a small but important shift in perspective. The SP boundary moves earlier in the pipeline. The system does not wait until the LLM sees a long token sequence; it starts balancing from the moment video becomes visual work.

Communication should match the hardware topology

LongVILA also shows that SP is not only about slicing tensors. It is about choosing a communication pattern that matches the machine. The paper contrasts Ring-style SP, which relies on point-to-point communication, with MM-SP’s 2D-Attention design: intra-node All-to-All uses fast NVLink bandwidth, while inter-node P2P handles the slower cross-node path [1].

LongVILA takeaway: for long-video understanding, SP has to become multi-modal SP. The system must know where visual tokens come from, not only where transformer tokens go.

4. LongLive-2.0: Balanced SP for long-video generation

LongLive-2.0 attacks a different problem: long-video generation infrastructure. The full system includes NVFP4 training and inference, KV-cache compression, parallel dequantization, and asynchronous VAE decoding. For this post, the key training-side idea is Balanced SP [2].

The difference from LongVILA is important. In understanding, the challenge is multi-modal token production and long-context VLM execution. In generation, the challenge is that the physical sequence layout encodes the training objective.

Animated Traditional SP diagram for LongLive-2.0 AR training
Figure 3a. Generated animated diagram based on LongLive-2.0. In traditional SP, VAE preparation is centralized and sharding over the concatenated clean/noisy sequence can concentrate the target loss on one rank. Source: LongLive-2.0 [2].
Animated Balanced SP diagram for LongLive-2.0 AR training
Figure 3b. Generated animated diagram based on LongLive-2.0. Balanced SP assigns each GPU a temporal clean/noisy pair, so VAE work, attention-mask ownership, and loss are distributed across ranks. Source: LongLive-2.0 [2].

The naive layout: clean everywhere, loss somewhere

The efficient teacher-forcing formulation builds a sequence like:

[ clean history latents ; noisy target latents ]

If ordinary SP slices this concatenated sequence without understanding the AR objective, some ranks may contain mostly clean context while another rank contains noisy target tokens that carry the loss. The sequence is partitioned, but the training work is not balanced.

The Balanced SP answer: same temporal chunk, same rank

Balanced SP changes the data layout. Each SP rank owns the clean latents and noisy latents from the same temporal chunk. That gives every rank both context tokens and target/loss tokens, and it lets the teacher-forcing attention mask be built naturally after the Ulysses-style All-to-All layout [2].

Traditional SP:
GPU0: clean z0
GPU1: clean z1
GPU2: clean z2
GPU3: noisy z3 + loss

Balanced SP:
GPU0: clean z0 + noisy z0 + local loss
GPU1: clean z1 + noisy z1 + local loss
GPU2: clean z2 + noisy z2 + local loss
GPU3: clean z3 + noisy z3 + local loss

SP starts before the transformer

The most interesting detail is that Balanced SP reaches before the DiT. Each rank VAE-encodes only its local raw-video chunk plus a left halo covering the VAE temporal receptive field, then discards the halo latent and keeps the local latent chunk [2].

This is a video-specific lesson. If the transformer is sharded but the VAE pipeline is replicated, the system has not actually made long-video training scale. SP must begin where the expensive sequence is created.

The performance lesson

LongLive-2.0 reports that NVFP4 plus Balanced SP is the fastest training configuration: for 16s, 32s, and 64s videos, iteration time is 40.1s, 119.3s, and 639.5s, giving 1.3×, 1.4×, and 2.1× speedups over the BF16+SP baseline. The paper also reports up to 2.15× training speedup, 1.84× inference speedup, and 45.7 FPS inference [2].

LongLive-2.0 takeaway: for long-video generation, the right SP unit is not just tokens. It is a temporal chunk that owns clean history, noisy target, VAE work, attention mask, and loss.

5. Differences between long video understanding and generation system

LongVILA and LongLive-2.0 both split inside a sample, but they assign ownership to different semantic units: multimodal tokens for understanding, and clean/noisy latent chunks for generation.

DimensionLongVILALongLive-2.0
TaskLong-video understanding / VLMLong-video generation / AR diffusion infrastructure
Sequence unitVisual tokens plus text tokensVideo latent chunks
Main bottleneckVision encoder workload, LLM context length, attention communicationClean/noisy layout, loss imbalance, VAE latent preparation, DiT activation
Why naive SP failsText-only token sharding ignores where visual tokens come fromConcatenated clean/noisy sharding can concentrate target/loss work
Core designMM-SP: shard by frames/images, then by tokensBalanced SP: each rank owns clean and noisy latents from the same temporal chunk
General lessonModality-aware ownershipObjective-aware temporal ownership
Animated diagram comparing semantic ownership in LongVILA and LongLive-2.0
Figure 4. Generated animated diagram. The shared SP principle is to split inside a sample; the video-specific question is which semantic unit becomes the owner of a shard.

The common abstraction is semantic ownership. A rank should not merely own a contiguous slice of a tensor. It should own a slice that makes the upstream encoder, the attention communication, the loss, and the hardware topology behave well together.

6. Design principles for long-video training system

Principle 1: Shard the real bottleneck

Do not split the easiest tensor; split the work that actually limits scale. For LongVILA, that includes frame/image encoding. For LongLive-2.0, it includes VAE preparation and target-loss ownership.

Principle 2: Preserve objective semantics

SP should not change temporal order, position identity, attention visibility, loss masks, or target ownership. The distributed layout must remain equivalent to the unsharded objective.

Principle 3: Match the hardware topology

Ring, Ulysses, 2D-Attention, USP, and LoongTrain differ mainly in how they communicate. A good video system chooses All-to-All, P2P, intra-node traffic, and inter-node traffic deliberately.

Principle 4: Start before the transformer

Video sequence construction begins before attention: frame loading, vision encoding, VAE encoding, latent chunking, and mask construction. If SP starts only inside transformer blocks, imbalance may already be baked in.

Principle 5: Check what each rank owns

Across samples?                         -> DP / FSDP / ZeRO
Inside one long sample?                 -> SP / context parallelism
Mostly text and enough attention heads? -> Ulysses-style SP
Need many nodes or beyond head limits?  -> Ring / USP / 2D-Attention / LoongTrain-style SP
Heavy multimodal encoder work?          -> MM-SP-style two-stage sharding
Clean/noisy AR video streams?           -> Balanced-SP-style temporal ownership

The final diagnostic is simple: after sharding, every rank should have meaningful work. If one rank owns the loss, every rank re-encodes the same video, or communication ignores the hardware topology, the layout is probably wrong.

Closing: The timeline is the new batch dimension

Long-video training breaks a basic assumption: one sample does not necessarily belong to one GPU. Once a sample becomes hundreds or thousands of frames, the system has to distribute work inside the sample itself.

But video also shows the limit of a generic SP story. LongVILA needs SP that understands modality and visual token production. LongLive-2.0 needs SP that understands clean/noisy latent streams, teacher forcing, VAE halos, and loss ownership.

In long-video training, the timeline is the new batch dimension. Sequence parallelism is how we scale it.

That is why SP belongs in the infrastructure stack for long video. A beautiful long-video demo proves capability. A well-designed SP system makes that capability trainable.

References

  1. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han. arXiv preprint, 2024. arXiv:2408.10188
  2. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv preprint, 2026. arXiv:2605.18739
  3. Sequence Parallelism: Long Sequence Training from System Perspective. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You. Annual Meeting of the Association for Computational Linguistics (ACL), 2023 / arXiv preprint, 2021. arXiv:2105.13120
  4. Reducing Activation Recomputation in Large Transformer Models. Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro. MLSys, 2023 / arXiv preprint, 2022. arXiv:2205.05198
  5. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. arXiv preprint, 2023. arXiv:2309.14509
  6. Ring Attention with Blockwise Transformers for Near-Infinite Context. Hao Liu, Matei Zaharia, Pieter Abbeel. arXiv preprint, 2023. arXiv:2310.01889
  7. USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. Jiarui Fang, Shangchun Zhao. arXiv preprint, 2024. arXiv:2405.07719
  8. LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu. arXiv preprint, 2024. arXiv:2406.18485
  9. Context Parallelism. NVIDIA Megatron Core Documentation. NVIDIA Docs