Yueming Hao (郝岳明)
I am a Research Scientist at Meta on the Triton Accel team. I graduated from North Carolina State University with a Ph.D. in 2024 under the advisement of Prof. Xu Liu.
I'm interested in developing tools for program performance analysis and optimizations for GPGPU applications.
Education
North Carolina State University, US
College of William and Mary, US
Transferred to NCSU following Prof. Xu Liu
Shandong University, China
Shandong University, China
Publications
-
[SC 2025] "RedSan: Redundant Memory Instruction Sanitizer for GPU Programs",
Yanbo Zhao, Yueming Hao, Zecheng Li, Shuyin Jiao, Xu Liu, Jiajia Li
The International Conference for High Performance Computing, Networking, Storage, and Analysis.
-
[ASPLOS 2025] "DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance
Profiling and Analysis of Deep Learning Workloads",
Qidong Zhao, Hao Wu, Yueming Hao, Zilingfeng Ye, Jiajia Li, Xu Liu, Keren Zhou
Architectural Support for Programming Languages and Operating Systems.
Paper
-
[CGO 2024] "DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications",
Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu
International Symposium on Code Generation and Optimization.
Paper
-
[arXiv] "TorchBench: Benchmarking PyTorch with High API Surface Coverage",
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, Xu Liu
Paper Code
-
[ICPE 2023] "DrGPU: A Top-Down Profiler for GPU Applications",
Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan, Xu Liu,
The International Conference on Performance Engineering.
Paper Code Slides Best Paper Finalist(Runner UP Award)
-
[ASPLOS 2022] "ValueExpert: Exploring Value Patterns in
GPU-accelerated Applications",
Keren Zhou,Yueming Hao*, John Mellor-Crummey, Xiaozhu Meng, Xu Liu,
Architectural Support for Programming Languages and Operating Systems.
(Keren and Yueming are co-first authors.)
Paper Code Slides Distinguished Artifact Award
-
[SC 2020] "GVPROF: A Value Profiler for GPU-based Clusters",
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, Xu Liu,
The International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 15-20, 2020, Atlanta, GA, USA.
Paper Code Slides
Experience
Meta, California
- Research Scientist, Triton Accel (since May 2025)
- Software Engineer, PyTorch Compiler (May 2024 – April 2025)
Research Scientist Intern, Full time & Part Time, Meta, California
Research Scientist Intern, Full time & Part time, Meta, California
Research Intern, NVIDIA, California
Projects
TritonParse is a comprehensive visualization and analysis tool for Triton kernels, designed to help developers analyze, debug, and understand Triton kernel compilation processes. It features an interactive kernel explorer with multi-format IR support (TTGIR, TTIR, LLIR, PTX, AMDGCN), side-by-side code comparison with synchronized highlighting, and structured logging for compilation tracing. The tool includes a React-based web interface with Monaco Editor integration and provides detailed source mapping extraction utilities. TritonParse helps visualize the entire Triton compilation pipeline from Python source to GPU assembly, making it invaluable for GPU kernel optimization and debugging.
We developed TorchBench, a novel benchmark suite to study the performance of the PyTorch software stack. Unlike existing benchmark suites, TorchBench encloses many representative models, covering a large PyTorch API surface. TorchBench is able to comprehensively characterize the performance of the PyTorch software stack, guiding the performance optimization across models, PyTorch framework, and GPU libraries. We show two practical use cases atop TorchBench. (1) We profile TorchBench to identify GPU performance inefficiencies in PyTorch. We are able to optimize many performance bugs and upstream the patches to the official PyTorch repository. (2) We integrate TorchBench into PyTorch continuous integration system. We are able to identify performance regression in multiple daily code checkins to prevent PyTorch repository from introducing performance bugs.
We implemented GVProf, the first value profiler that locates value redundancy problems in applications running on GPU-based clusters. Our experiments show that GVProf incurs acceptable overhead and scales to large executions. GVProf provides useful insights to guide performance optimization. Under the guidance of GVProf, we optimized several HPC and machine learning workloads
DrGPU is a Top-Down profiler for GPU Applications. More specifically, it is a trace analyzer for CUDA kernels to analyze the bottleneck and give suggestions for performance optimization.
Awards
2023 Best Paper Finalist, ICPE
2022 Distinguished Artifact Award, ASPLOS
2021 Runner Up, A-HUG Cloud HPC Hackathon
2021 Summer Graduate Merit Award, NCSU
2015 First Prize of China Undergraduate Mathematical Contest in
Modeling, China
Professional Service
Conference Observer: ICPP 2020
Conference Reviewer: IPDPS 2023, ICPE 2023
Artifact Evaluation Committee: ASPLOS 2024, CGO 2024–2026, MICRO 2023, PPoPP 2021–2024, SC 2024
Web Chair: LCTES 2021
Journal Reviewer: TECS 2021, TPDS 2024, TACO 2025