I am a fifth-year Ph.D. candidate (2020-) at HAN LAB of MIT EECS, advised by Prof. Song Han. My research interest is systems and machine learning (SysML). I worked on full-stack efficient 3D deep learning for autonomous driving in the first half my Ph.D., and became interested in efficient foundation models for multimodal generation recently. I received the MLSys 2024 Best Paper Award as the system co-lead of the AWQ project.
I interned at NVIDIA, Alphabet (Waymo/Google), OmniML (now part of NVIDIA) during my PhD. I did my master of science in EECS at MIT in 2022. Before that, I graduated with highest honor from the Department of Computer Science and Engineering of Shanghai Jiao Tong University in 2020, where I was fortunately advised by Prof. Hongtao Lu. I was also affiliated with the IEEE Honor Class at SJTU.
News
May 14, 2024 |
I delivered the oral presentation for our MLSys 2024 Best Paper, AWQ!
|
May 7, 2024 |
We announced the QServe inference engine for cloud LLM serving with W4A8KV4 precision. It matches the TRT-LLM throughput on A100 with 3x cheaper L40S GPUs. I led the system design with Shang Yang.
|
Jan 16, 2024 |
Our new paper, LongLoRA, has been accepted to ICLR 2024 as an oral presentation.
|
Jul 24, 2023 |
Our new paper, TorchSparse++, has been accepted by MICRO 2023. This is my first paper as a leading author at a computer architecture conference.
|
May 30, 2023 |
I started my internship at Waymo Research with Kan Chen, Rami Al-Rfou, Charles R. Qi, Kratarth Goel. I also work closely with Mingxing Tan, Yin Zhou and their teams.
|
Recent Publications
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Yujun Lin*,
Haotian Tang*,
Shang Yang*,
Zhekai Zhang,
Guangxuan Xiao,
Chuang Gan,
and Song Han
arXiv
2024
[Abs]
[arXiv]
[Website]
[Code]
Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100, 1.4× on L40S; and Qwen1.5-72B by 2.4× on A100, 3.5× on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3×.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin*,
Jiaming Tang*,
Haotian Tang+,
Shang Yang+,
Wei-Ming Chen,
Wei-Chen Wang,
Guangxuan Xiao,
Xingyu Dang,
Chuang Gan,
and Song Han
MLSys (Best Paper Award)
2024
[Abs]
[arXiv]
[Website]
[Code]
Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs’ generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPU (NVIDIA Jetson Orin 64GB).
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Yukang Chen,
Shengju Qian,
Haotian Tang,
Xin Lai,
Zhijian Liu,
Song Han,
and Jiaya Jia
ICLR (Oral)
2024
[Abs]
[arXiv]
[Website]
[Code]
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models’ context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
Haotian Tang*,
Shang Yang*,
Zhijian Liu,
Ke Hong,
Zhongming Yu,
Xiuyu Li,
Guohao Dai,
Yu Wang,
and Song Han
IEEE/ACM International Symposium on Microarchitecture (MICRO)
2023
[Abs]
[arXiv]
[Website]
[Code]
Point cloud computation is important for AR/VR and ADAS. It involves sparse and irregular computation patterns, requiring specialized high-performance kernels. Existing GPU libraries offer two dataflow types for sparse point cloud convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g. implicit GEMM) are highly performant but have very high engineering costs. In this work we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse point cloud convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing point cloud libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9x, 3.3x, 2.2x and 1.7x measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3x faster than SpConv v2 in mixed precision training.
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
Zhijian Liu*,
Haotian Tang*,
Alexander Amini,
Xinyu Yang,
Huizi Mao,
Daniela Rus,
and Song Han
ICRA
2023
[Abs]
[arXiv]
[Website]
[Code]
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird’s-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost.
TorchSparse: Efficient Point Cloud Inference Engine
Haotian Tang*,
Zhijian Liu*,
Xiuyu Li*,
Yujun Lin,
and Song Han
MLSys
2022
[Abs]
[arXiv]
[Website]
[PDF]
[Code]
Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware, and existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: data movement and irregular computation. It optimizes the data orchestration by quantization and fused locality-aware memory access, reducing the memory movement cost by 2.7x. It also adopts adaptive MM grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications
Han Cai*,
Ji Lin*,
Yujun Lin*,
Zhijian Liu*,
Haotian Tang*,
Hanrui Wang*,
Ligeng Zhu*,
and Song Han
ACM Transactions on Design Automation of Electronic Systems (TODAES)
2022
[Abs]
[arXiv]
[PDF]
Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing and speech recognition. However, their superior performance comes at the considerable cost of computational complexity, which greatly hinders their applications in many resource-constrained devices, such as mobile phones and Internet of Things (IoT) devices. Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand in order to enable numerous edge AI applications.
This paper provides an overview of efficient deep learning methods, systems and applications. We start from introducing popular model compression methods, including pruning, factorization, quantization as well as compact model design. To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. We then cover efficient on-device training to enable user customization based on the local data on mobile devices. Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives.
PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution
Zhijian Liu*,
Haotian Tang*,
Shengyu Zhao,
Kevin Shao,
and Song Han
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
2021
[Abs]
[arXiv]
[PDF]
3D neural networks are widely used in real-world applications (e.g., AR/VR headsets, self-driving cars). They are required to be fast and accurate; however, limited hardware resources on edge devices make these requirements rather challenging. Previous work processes 3D data using either voxel-based or point-based neural networks, but both types of 3D models are not hardware-efficient due to the large memory footprint and random memory access. In this paper, we study 3D deep learning from the efficiency perspective. We first systematically analyze the bottlenecks of previous 3D methods. We then combine the best from point-based and voxel-based models together and propose a novel hardware-efficient 3D primitive, Point-Voxel Convolution (PVConv). We further enhance this primitive with the sparse convolution to make it more effective in processing large (outdoor) scenes. Based on our designed 3D primitive, we introduce 3D Neural Architecture Search (3D-NAS) to explore the best 3D network architecture given a resource constraint. We evaluate our proposed method on six representative benchmark datasets, achieving state-of-the-art performance with 1.8-23.7x measured speedup. Furthermore, our method has been deployed to the autonomous racing vehicle of MIT Driverless, achieving larger detection range, higher accuracy and lower latency.
PointAcc: Efficient Point Cloud Accelerator
Yujun Lin,
Zhekai Zhang,
Haotian Tang,
Hanrui Wang,
and Song Han
MICRO
2021
[Abs]
[arXiv]
[Website]
Deep learning on point clouds plays a vital role in a wide range of applications such as autonomous driving and AR/VR. These applications interact with people in real time on edge devices and thus require low latency and low energy. However, the intrinsic sparsity of point clouds poses challenges to hardware acceleration. Mapping operations are introduced to construct output clouds and find neighbors for convolution, which are unsupported in existing deep learning accelerators. Furthermore, since points are sparsely distributed in space, point cloud convolution requires explicit gather and scatter of features, which brings large data movement overhead. In this work, we first provide a comprehensive analysis of the performance of modern point cloud networks on CPU/GPU/TPU. Then we present PointAcc, a novel point cloud deep learning accelerator.
PointAcc introduces a configurable sorting-based mapping unit that efficiently supports diverse mapping operations.
PointAcc further introduces the data orchestration that exploits simplified caching and layer fusion specialized for point cloud models, effectively reducing the DRAM access by up to 6.3x. Evaluated on 8 point cloud models across 4 different applications, our accelerator achieves 3.7x, 53x, 90x speedup and 22x, 210x, 175x energy savings over GPU, TPU, and CPU, respectively. Co-designed with efficient point cloud networks, PointAcc rivals the prior Mesorasi accelerator by 100x speedup and 9.1 mIoU accuracy improvement running segmentation task on S3DIS dataset. PointAcc paves the way for efficient point cloud recognition.
SemAlign: Annotation-Free Camera-LiDAR Calibration with Semantic Alignment Loss
Zhijian Liu*,
Haotian Tang*,
Sibo Zhu*,
and Song Han
IROS
2021
[Abs]
[Website]
[PDF]
Multi-sensor solution has been widely adopted in real-world robotics systems (e.g., self-driving vehicles) due to its better robustness. However, its performance is highly dependent on the accurate calibration between different sensors, which is very time-consuming (i.e., hours of human efforts) to acquire. Recent learning-based solutions partially address this yet still require costly ground-truth annotations as supervision. In this paper, we introduce a novel self-supervised semantic alignment loss to quantitatively measure the quality of a given calibration. It is well correlated with conventional evaluation metrics while it does not require ground-truth calibration annotations as the reference. Based on this loss, we further propose an annotation-free optimization-based calibration algorithm (SemAlign) that first estimates a coarse calibration with loss-guided initialization and then refines it with gradient-based optimization. SemAlign reduces the calibration time from hours of human efforts to only seconds of GPU computation. It not only achieves comparable performance with existing supervised learning frameworks but also demonstrates a much better generalization capability when transferred to a different dataset.
Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
Haotian Tang*,
Zhijian Liu*,
Shengyu Zhao,
Yujun Lin,
Ji Lin,
Hanrui Wang,
and Song Han
ECCV
2020
[Abs]
[arXiv]
[Website]
[Code]
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (\eg, pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive SemanticKITTI leaderboard upon publication. It also achieves 8x computation reduction and 3x measured speedup over MinkowskiNet still with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI.
Point-Voxel CNN for Efficient 3D Deep Learning
Zhijian Liu*,
Haotian Tang*,
Yujun Lin,
and Song Han
NeurIPS (Spotlight)
2019
[Abs]
[arXiv]
[Website]
[Code]
We present Point-Voxel CNN (PVCNN) for efficient, fast 3D deep learning. Previous work processes 3D data using either voxel-based or point-based NN models. However, both approaches are computationally inefficient. The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution. As for point-based networks, up to 80% of the time is wasted on structuring the sparse data which have rather poor memory locality, not on the actual feature extraction. In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality. Our PVCNN model is both memory and computation efficient. Evaluated on semantic and part segmentation datasets, it achieves a much higher accuracy than the voxel-based baseline with 10x GPU memory reduction; it also outperforms the state-of-the-art point-based models with 7x measured speedup on average. Remarkably, the narrower version of PVCNN achieves 2x speedup over PointNet (an extremely efficient model) on part and scene segmentation benchmarks with much higher accuracy. We validate the general effectiveness of PVCNN on 3D object detection: by replacing the primitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by up to 2.4 mAP with 1.8 measured speedup and 1.4 GPU memory reduction.
Service
I regularly serve as a reviewer for ICML (outstanding reviewer, 2022), NeurIPS (top reviewer, 2022 and 2023), ICLR (highlighted reviewer, 2022), TPAMI, IJCV, CVPR, ICCV.