NVIDIA Research
Yukang Chen 陈玉康
Research Scientist | Long AI Systems
Email ·
Google Scholar ·
GitHub ·
Homepage
I am a Research Scientist at NVIDIA Research, working with
Prof. Song Han. I received my Ph.D. in Computer Science from CUHK.
🔬 Research Focus
My research focuses on Long AI Systems through algorithm-system co-design: co-designing model algorithms, data/training recipes, distributed training systems, memory-efficient inference, and low-precision deployment to scale AI to long horizons efficiently.
- My work spans long-video generation systems, long reasoning acceleration inference systems, long-video reinforcement learning systems, long-video understanding training systems, and long-context large language models.
- Recent systems include LongLive-2.0 for FP4 long-video generation infrastructure, TriAttention for long-reasoning inference acceleration across vLLM/SGLang/TensorRT/OpenClaw, Long-RL/MR-SP for hour-level long-video RL, LongVILA/MM-SP for 2M-token VLM training.
- If you are interested in Long AI Systems and collaboration, please feel free to contact me via Email.
🚀 Representative Systems & Algorithms
Long-video Generation System
FP4/NVFP4 long-video generation infrastructure with Balanced SP, teacher-forcing layout co-design, W4A4 inference, KV cache compression, parallel dequantization, and asynchronous streaming VAE decoding.
Long Reasoning Acceleration Inference System
Training-free KV cache compression for long reasoning, integrated with vLLM, SGLang, TensorRT deployment path, LongLive KV-compressed video generation, and OpenClaw custom-provider deployment.
Long-video Reinforcement Learning System
A full-stack long-video RL system combining LongVideo-Reason, CoT-SFT/RL, sequence parallelism, vLLM-based rollout/prefill, and cached video embeddings for hour-level video reasoning.
Long-video Understanding Training System
Algorithm-system co-design for long-video VLMs, enabling 2M-token context training on 256 GPUs without gradient checkpointing through Multi-Modal Sequence Parallelism.
Long-context Large Language Model
Efficient long-context fine-tuning via shifted sparse attention and improved LoRA, extending Llama2-7B to 100k context and Llama2-70B to 32k context on a single 8x A100 machine.
Long-range Autonomous Driving Perception
Fully sparse VoxelNet for 3D object detection and tracking; extends perception range by 4x without inference overhead and ranked 1st on nuScenes LiDAR 3D detection and tracking leaderboards.
🎓 Background
NVIDIA ResearchResearch Scientist, Efficient AI / Long AI Systems, Sep 2024 - Present
The Chinese University of Hong KongPh.D., Computer Science, Aug 2020 - Jul 2024
🔥 News
💬 Invited Talks and Reports
Arxiv 2026

LongLive 2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
[Paper]
[Code]
[Demo]
[Abstract]
We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.
Yukang Chen , Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han
- The first open-source FP4 Infra for Long Video Gen.
- Real-time Inference - 45.7 FPS on 5B model.
- Support real-video training, few-step distillation, multi-shot, sequence-parallel, NVFP4 KV cache, and async VAE decoding.
ICLR 2026

LongLive: Real-time Interactive Long Video Generation
[Paper]
[Code]
[Demo]
[Abstract]
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
- Real-time Inference - 20.7 FPS generation on a single H100 GPU.
- Long Video Gen - Up to 240-second generation with interactive prompts.
- Efficient Fine-tuning - Extend Wan to minute-long in 32 H100 GPU-days.
ICML 2026

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
[Paper]
[Code]
[Demo]
[Abstract]
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen
- High Efficiency - 2.5x higher FPS and 10.7x KV memory reduction in LLMs.
- OpenClaw - 32B LLM on a 24GB GPU.
- Long Video Gen - Reducing 50% KV Cache in AR Long Video Generation.
NeurIPS 2025

Long-RL: Scaling RL to Long Sequences
[Paper]
[Code]
[Demo]
[Abstract]
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
- MR-SP System - RL on hour-long videos (3,600 frames), up to 2.1x speedup.
- LongVILA-R1-7B - 8,192 frames/video and 71.1% on VideoMME with sub.
- LongVideo-Reason Dataset - 104K long-video QA-reasoning pairs.
ICLR 2025

LongVILA: Scaling Long-Context Visual Language Models for Long Videos
[Paper]
[Code]
[Abstract]
Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g., 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
- MM-SP System - 2M-tokens training on 256 GPUs, 1.4x faster than Megatron.
- LongVILA-7B - 99.8% on 6,000-frame (>1M tokens) needle-in-a-haystack.
- LongVILA-SFT Dataset - 54K high-quality long video QA pairs.
ICLR 2024 Oral

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
[Paper]
[Code]
[Abstract]
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia
- Efficient Fine-tuning - 100k context on a single 8x A100, 1.8x speed up.
- Easy Implementation - Shifted sparse attention, compatible with Flash-Attn.
- LongAlpaca - The first open-source long instruction-following dataset.
📋 Academic Services
Area ChairAAAI 2026
Journal ReviewerT-PAMI and T-TIP
Conference ReviewerNeurIPS, ICLR, ICML, CVPR, ICCV, ECCV, and AAAI
🎖 Honors and Awards
2025World's Top 2% Scientists.
2023Final-list candidate of ByteDance Scholarship.
20221st on nuScenes LiDAR 3D Object Detection leaderboard.
20221st on nuScenes LiDAR Multi-Object Tracking leaderboard.
2023Winner of ScanNet Indoor Scene Understanding (CVPR 2023 ScanNet Workshop).
2019Winner of COCO Detection Challenge (ICCV 2019 COCO Workshop).