About me

My research interests are broadly in distributed systems. I'm currently a Research Scientist at ByteDance's AI Networking team, working on Systems for ML, with a major focus on LLM reliability. My experiences are widely on distributed systems, cloud systems, and AI infrastructure.

My research has been awarded with an ACM SIGMETRICS Kenneth C. Sevcik Outstanding Student Paper Award. I was a postdoc researcher at Princeton University, working with Prof. Ravi Netravali. Before that, I received my Ph.D. from Emory University in 2021 (working with Prof. Ymir Vigfusson), Master from Georgia Tech in 2017(was in Ph.D. program, worked with Prof. Karsten Schwan memorial page ), and Bachelor from Tsinghua University in 2015. I transferred to Emory in 2018 as a post-qualified Ph.D. student.

Experience

Research Scientist, ByteDance Inc.
2023-current

Postdoc, Princeton University
2022-2023

Research Assistant, Emory University
2018-2021

Teaching Assistant, Emory University
Fall 2020

CS 377: Database Systems

Ph.D. Intern, Facebook Inc.
Summer 2018

Research Assistant, Georgia Tech
2015-2018

Teaching Assistant, Georgia Tech
Fall 2016

CS 3210: Design Operating Systems

Services & Awards

Program Committee
USENIX ATC'25

Program Committee
ACM EuroSys'25

External Program Committee
ACM SIGMETRICS'23

Program Committee
ACM SoCC'22, 23, 24

Best Student Paper Award
ACM SIGMETRICS’20

Bronze medal
24th, 25th China Mathematical Olympiad

Publication

Thesis:

Measurement and Analysis Methods of Performance Problems in Distributed Systems
Emory University
  Thesis

Papers:

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu
In USENIX NSDI 2025
  Paper     Slides

LatenSeer: Causal Modeling of End-to-End Latency Distributions by Harnessing Distributed Tracing

Yazhuo Zhang, Rebecca Isaacs, Yao Yue, Juncheng Yang, Lei Zhang, Ymir Vigfusson
In ACM SoCC 2023
  Paper     Code

The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems

Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace
In USENIX NSDI 2023
  Paper     Slides     Code     Benchmark Code     Video

When is the Cache Warm? Manufacturing a Rule of Thumb

Lei Zhang, Juncheng Yang, Anna Blasiak, Mike McCall, Ymir Vigfusson
In USENIX HotCloud 2020
  Paper     Slides     Video

Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems

Lei Zhang, Reza Karimi, Irfan Ahmad, Ymir Vigfusson
In ACM SIGMETRICS 2020
  Paper     Video
  Kenneth C. Sevcik Outstanding Student Paper Award

Deceptive Secret Sharing

Lei Zhang, Douglas Blough
In IEEE International Conference on Dependable Systems and Networks (DSN) 2018
  Paper

Systematic Data Placement Optimization in Multi-Cloud Storage for Complex Requirements

Maomeng Su, Lei Zhang, Yongwei Wu, Kang Chen, Keqin Li
In IEEE Transactions on Computers 2016, 65(6): 1964-1977.
  Paper

Under Review:

Tracing Dependencies in Collective Communication Towards Reliable LLM Training

Towards Bandwidth-adaptive Live Volumetric Video Conferencing

A Lightweight Telemetry System with Service Tracing for Locating Network Slowdowns

Automatic Instrumentation for Fine-grained Observability in Distributed Systems