Why Video Gen Is an Infra Problem

📝 Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Weian Mao, Song Han
📅 May 26, 2026 ⏱️ 10 min read

The first wave of modern video generation was about capability: Sora changed how people thought about the field by showing that scaled video models could move beyond short clips toward minute-long, high-fidelity videos across variable durations, resolutions, and aspect ratios [1]. But capability is only the first shock. The next wave is about complexity: models such as Seedance 2.0 point to a future where video generation is not only longer or prettier, but increasingly multimodal, controllable, editable, audio-video synchronized, and low-latency [2].

This shift changes the question.

Can the model generate a beautiful video? → Can the system generate a long, consistent, controllable video under real memory, latency, and deployment constraints?

Once video generation moves from short isolated clips to long, controllable, multimodal systems, the bottleneck is no longer only the denoising model. The bottleneck becomes infrastructure.

A video generator is no longer just a model. It is an operating system for visual tokens.

Contents

Why a 60-second video is not just six 10-second videos
Bottleneck migration; End-to-end system for video generation
KV cache is the memory wall
VAE is not post-processing
Low-bit training-inference alignment
Parallelism requires system–algorithm co-design
Closing: Video Gen Is an Infra Problem

1. Why a 60-second video is not just six 10-second videos

A tempting mental model is:

If a model can generate a good 10-second video, then a 60-second video should just be six 10-second videos stitched together.

This is wrong. A 60-second video is not only longer. It changes the nature of the problem.

When duration grows, the system does not simply pay 6× compute. It also accumulates memory, KV cache, synchronization overhead, VAE decoding latency, temporal drift, scene-level consistency constraints, and train-test mismatch. A character has to remain the same person. A room has to preserve its layout. Motion has to evolve naturally. A shot transition should change what needs to change, but preserve what should remain global. The system must remember, forget, compress, decode, and schedule.

This is a useful distinction: theoretical duration is not the same as effective duration. A model may technically produce 60 seconds of video, but the useful memory, visual consistency, and deployment efficiency across those 60 seconds are separate questions. The longer the video becomes, the more the system has to decide what to remember, what to compress, what to decode, and when to synchronize.

Long video generation is not short video generation repeated over time. It is an online system with memory, compression, scheduling, synchronization, and error accumulation.

Video 1. A 60-second generated video example. Long-video generation is a stateful process, not a simple concatenation of independent 10-second clips.

2. Bottleneck migration; End-to-end system for video generation

In many-step diffusion pipelines, the DiT denoiser dominates latency, so it is natural to focus on reducing sampling steps or accelerating transformer throughput. But as video generation moves toward fewer steps, autoregressive decoding, KV caching, and low-precision inference, hidden system costs start to surface: VAE decoding, KV-cache updates, synchronization, memory transfer, and runtime scheduling.

CausVid illustrates this shift by turning bidirectional video diffusion into an autoregressive few-step generator, enabling streaming generation at 9.4 FPS on a single GPU with KV caching [4]. LTX-Video shows a similar trend from another angle: it co-designs the Video-VAE and denoising transformer with a highly compressed latent space, reporting faster-than-real-time generation [5].

When the DiT is slow, the DiT is the bottleneck. Once the DiT becomes fast, the bottleneck moves to the rest of the system. A mature video generation system is not one where every component is fast in isolation, but one where no hidden component—VAE, KV cache, precision runtime, synchronization, or data movement—silently dominates end-to-end latency.

One way to see this is to write video generation as a system rather than a model.

VideoGen System =
    Tokenizer / VAE
  + Denoising Engine
  + Temporal Memory
  + Precision Runtime
  + Parallel Execution
  + Decoder Scheduler

And a simplified end-to-end latency equation looks like:

T_e2e ≈ T_tokenize
      + N_chunks · T_denoise
      + N_chunks · T_KV
      + T_sync
      + T_transfer
      + T_decode

Most discussions focus on T_denoise. But users experience T_e2e.

If a paper or demo reports only denoising FPS, it may miss the actual latency before the user receives pixels. In long video generation, VAE decoding, KV-cache movement, GPU synchronization, CPU-GPU transfer, and multi-shot scheduling can become large enough to change the architecture of the system.

This is why “model FPS” is an incomplete metric for video generation. It is useful, but it is not the full story.

A video model is not truly fast unless the user receives pixels fast.

This is also why LongLive 2.0 was designed as an end-to-end infrastructure problem rather than only a model optimization problem. LongLive 2.0 targets training and inference together: Balanced sequence-parallel training, NVFP4 training and inference, W4A4 execution, NVFP4 KV cache, parallel dequantization, and asynchronous VAE decoding are part of the same system [3].

Animated LongLive 2.0 end-to-end training and inference infrastructure overview. — LongLive 2.0 end-to-end training and inference infrastructure overview.

Figure 1. LongLive 2.0 as an end-to-end video generation infrastructure. Training-side and inference-side components, including AR training, NVFP4 training, Balanced SP, NVFP4 W4A4/KV, parallel dequantization, and asynchronous decoding, are optimized together.

3. KV cache is the memory wall

In autoregressive video generation, KV cache is not just an implementation detail. It is the working memory of the video generator. If the system cannot store, compress, and update history efficiently, the model does not only become slower — it also starts to lose identity, scene layout, and motion consistency over long horizons.

Long-video memory is also structured. For multi-shot generation, some memory should persist across the entire video, while some should be refreshed when the scene changes. In LongLive 2.0, we use a global-level sink to preserve long-horizon anchors, and a shot-level sink to maintain the current scene. When a new shot arrives, the global memory remains, while the shot-level memory is updated.

Animated multi-shot attention sink diagram showing Shot 1, Shot 2, and Shot 3 with global-level and shot-level sinks.

Figure 2. Multi-shot attention sink for streaming multi-shot inference. The global-level sink remains across shots, while the shot-level sink is refreshed as the scene changes.

This is why KV-cache compression is not only about saving memory. It is about preserving enough useful working memory for the video to remain coherent. Quant VideoGen also identifies KV-cache memory as a key bottleneck for autoregressive video diffusion, where constrained KV-cache budgets can hurt long-horizon consistency in identity, layout, and motion [6]. In LongLive 2.0, NVFP4 KV cache, parallel dequantization, and asynchronous decoding are designed together so that long-video memory is both algorithmically meaningful and systemically affordable [3].

4. VAE is not post-processing

A common benchmarking mistake is to report denoising speed and treat VAE decoding as a fixed tax.

For short videos, this approximation may be acceptable. For long videos, it becomes misleading.

VAE decoding is not simply the final step that converts latents into pixels. It affects end-to-end latency, peak memory, streaming behavior, and the shape of the whole runtime pipeline.

Consider two possible pipelines.

Naive pipeline:
DiT chunk 1 → DiT chunk 2 → ... → DiT chunk N
→ VAE chunk 1 → VAE chunk 2 → ... → VAE chunk N

Streaming pipeline:
DiT chunk 1 → DiT chunk 2 → DiT chunk 3 → ...
              VAE chunk 1 → VAE chunk 2 → ...

Animated chunk-by-chunk asynchronous VAE decoding pipeline.

Figure 3. Chunk-by-chunk asynchronous VAE decoding. The DiT side continues generating later latent chunks while the VAE side decodes earlier chunks into video.

In the streaming pipeline, decoding is overlapped with denoising. The user does not have to wait for all latent chunks to finish before decoding begins. The VAE becomes part of the runtime scheduler.

This is why LongLive 2.0 uses asynchronous streaming VAE decoding. The paper emphasizes that real deployment needs more than diffusion-model FPS because KV cache, VAE decoding, and multi-shot continuous generation latency all have to be handled at the system level [3].

The VAE is not post-processing. It is the pixel runtime.

5. Low-bit training-inference alignment

Low precision is often described as compression, but for long video generation it is also an alignment problem. In autoregressive video, quantization errors are not isolated: they can enter the generated history, be stored in the KV cache, and condition future chunks. Therefore, a mismatch between training and inference precision can affect not only single-frame quality, but also long-horizon stability.

Qualitative comparison between post-training NVFP4 and training-aware NVFP4 across three shots.

Figure 4. Training-aware NVFP4 versus post-training NVFP4. The comparison shows that training-inference alignment helps preserve visual details across multiple shots.

This is why we treat NVFP4 as part of the full training-and-inference infrastructure, rather than a simple post-training quantization trick. In LongLive 2.0, NVFP4-aware training, W4A4 inference, KV-cache quantization, LoRA handling, and efficient dequantization are designed together to reduce the gap between how the model is trained and how it is deployed [3].

For long video generation, precision is not just a number. It is an interface between the algorithm and the system.

6. Parallelism Requires System–Algorithm Co-Design

Parallelism is often treated as a low-level implementation detail, but for long video generation it is also a part of the algorithm design. In autoregressive video training, the sequence is not a flat list of tokens: it contains clean history, noisy target chunks, teacher-forcing masks, VAE latents, and loss-bearing tokens. If we shard this structure naively, we can create load imbalance and unnecessary duplicated computation.

Animated traditional sequence parallelism under teacher forcing showing computation imbalance.

Figure 5. Teacher forcing plus traditional sequence parallelism can create computation imbalance: clean chunks are distributed across ranks, while the noisy target and loss concentrate on one rank.

This motivates Balanced SP in LongLive 2.0. Instead of partitioning clean-context and noisy-target latents as a flat sequence, Balanced SP pairs clean and noisy latents from the same temporal chunk on each GPU. This balances loss-bearing computation across ranks and aligns teacher forcing, sequence-parallel execution, and chunked VAE encoding within the same layout [3].

Good parallelism is not invisible. Good parallelism makes the algorithm look simpler.

Closing: Video Gen Is an Infra Problem

The Sora moment showed that video generation could scale into long, high-fidelity visual worlds. The Seedance moment points toward controllable, multimodal, interactive video systems. The next step is infrastructure.

The next generation of video models will not be defined only by better denoisers. It will be defined by systems that can remember longer, decode faster, quantize safely, parallelize naturally, and deliver pixels under real latency and memory budgets.

Video generation is becoming an infrastructure problem.

References

Video generation models as world simulators. OpenAI. OpenAI Blog, 2024. OpenAI Blog
Seedance 2.0: Advancing Video Generation for World Complexity. Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, et al. arXiv preprint, 2026. arXiv:2604.14148
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv preprint, 2026. arXiv:2605.18739
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models. Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, Xun Huang. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.07772
LTX-Video: Realtime Video Latent Diffusion. Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, Ofir Bibi. arXiv preprint, 2025. arXiv:2501.00103
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization. Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer. arXiv preprint, 2026. arXiv:2602.02958