Yueming Hao (郝岳明)

I am a Research Scientist at Meta on the Triton Accel team. I graduated from North Carolina State University with a Ph.D. in 2024 under the advisement of Prof. Xu Liu.

I'm interested in developing tools for program performance analysis and optimizations for GPGPU applications.

Education

North Carolina State University, US

Doctor of Philosophy, Computer Science, Advisor: Prof. Xu Liu

August 2020 - May 2024

College of William and Mary, US

Doctor of Philosophy, Computer Science, Advisor: Prof. Xu Liu
Transferred to NCSU following Prof. Xu Liu

August 2019 - August 2020

Shandong University, China

M.S., Computer Science and Technology, Advisor: Prof. Lei Ju

August 2016 - June 2019

Shandong University, China

B.S., Computer Science and Technology

August 2012 - June 2016

Publications

[SC 2025] "RedSan: Redundant Memory Instruction Sanitizer for GPU Programs",
Yanbo Zhao, Yueming Hao, Zecheng Li, Shuyin Jiao, Xu Liu, Jiajia Li
The International Conference for High Performance Computing, Networking, Storage, and Analysis.
[ASPLOS 2025] "DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads",
Qidong Zhao, Hao Wu, Yueming Hao, Zilingfeng Ye, Jiajia Li, Xu Liu, Keren Zhou
Architectural Support for Programming Languages and Operating Systems.
Paper
[CGO 2024] "DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications",
Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu
International Symposium on Code Generation and Optimization.
Paper
[arXiv] "TorchBench: Benchmarking PyTorch with High API Surface Coverage",
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, Xu Liu
Paper Code
[ICPE 2023] "DrGPU: A Top-Down Profiler for GPU Applications",
Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan, Xu Liu,
The International Conference on Performance Engineering.
Paper Code Slides Best Paper Finalist(Runner UP Award)
[ASPLOS 2022] "ValueExpert: Exploring Value Patterns in GPU-accelerated Applications",
Keren Zhou,Yueming Hao*, John Mellor-Crummey, Xiaozhu Meng, Xu Liu,
Architectural Support for Programming Languages and Operating Systems.
(Keren and Yueming are co-first authors.)
Paper Code Slides Distinguished Artifact Award
[SC 2020] "GVPROF: A Value Profiler for GPU-based Clusters",
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, Xu Liu,
The International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 15-20, 2020, Atlanta, GA, USA.
Paper Code Slides

Experience

Meta, California

Research Scientist, Triton Accel (since May 2025)
Software Engineer, PyTorch Compiler (May 2024 – April 2025)

May 2024 – Present

Research Scientist Intern, Full time & Part Time, Meta, California

Intern Mentor: Bin Bao, PyTorch Compiler

May 2023 - December 2023

Research Scientist Intern, Full time & Part time, Meta, California

Intern Mentor: Xu Zhao, PyTorch Perf Infra

May 2022 - December 2022

Research Intern, NVIDIA, California

Intern Mentor: Nikhil Jain, HPC Arch

May 2020 - August 2020

Projects

TritonParse

TritonParse is a comprehensive visualization and analysis tool for Triton kernels, designed to help developers analyze, debug, and understand Triton kernel compilation processes. It features an interactive kernel explorer with multi-format IR support (TTGIR, TTIR, LLIR, PTX, AMDGCN), side-by-side code comparison with synchronized highlighting, and structured logging for compilation tracing. The tool includes a React-based web interface with Monaco Editor integration and provides detailed source mapping extraction utilities. TritonParse helps visualize the entire Triton compilation pipeline from Python source to GPU assembly, making it invaluable for GPU kernel optimization and debugging.

TorchBench

We developed TorchBench, a novel benchmark suite to study the performance of the PyTorch software stack. Unlike existing benchmark suites, TorchBench encloses many representative models, covering a large PyTorch API surface. TorchBench is able to comprehensively characterize the performance of the PyTorch software stack, guiding the performance optimization across models, PyTorch framework, and GPU libraries. We show two practical use cases atop TorchBench. (1) We profile TorchBench to identify GPU performance inefficiencies in PyTorch. We are able to optimize many performance bugs and upstream the patches to the official PyTorch repository. (2) We integrate TorchBench into PyTorch continuous integration system. We are able to identify performance regression in multiple daily code checkins to prevent PyTorch repository from introducing performance bugs.

GVProf & ValueExpert

We implemented GVProf, the first value profiler that locates value redundancy problems in applications running on GPU-based clusters. Our experiments show that GVProf incurs acceptable overhead and scales to large executions. GVProf provides useful insights to guide performance optimization. Under the guidance of GVProf, we optimized several HPC and machine learning workloads

DrGPU

DrGPU is a Top-Down profiler for GPU Applications. More specifically, it is a trace analyzer for CUDA kernels to analyze the bottleneck and give suggestions for performance optimization.

Awards

2023 Best Paper Finalist, ICPE
2022 Distinguished Artifact Award, ASPLOS
2021 Runner Up, A-HUG Cloud HPC Hackathon
2021 Summer Graduate Merit Award, NCSU
2015 First Prize of China Undergraduate Mathematical Contest in Modeling, China

Professional Service

Conference Observer: ICPP 2020
Conference Reviewer: IPDPS 2023, ICPE 2023
Artifact Evaluation Committee: ASPLOS 2024, CGO 2024–2026, MICRO 2023, PPoPP 2021–2024, SC 2024
Web Chair: LCTES 2021
Journal Reviewer: TECS 2021, TPDS 2024, TACO 2025