About me
My research interests are broadly in distributed systems. I'm currently a Research Scientist at ByteDance's AI Networking team, working on Systems for ML, with a major focus on LLM reliability. My experiences are widely on distributed systems, cloud systems, and AI infrastructure.
- Reliability and observability: distributed tracing, PL for systems towards root cause analysis
- Distributed caching: heterogeneous memory management, CDN caching, performance quantification
- Systems for ML: reliability, observability, performance analysis, RDMA-based AI infra for LLM, Collective Communication Library
My research has been awarded with an ACM SIGMETRICS Kenneth C. Sevcik Outstanding Student Paper Award. I was a postdoc researcher at Princeton University, working with Prof. Ravi Netravali. Before that, I received my Ph.D. from Emory University in 2021 (working with Prof. Ymir Vigfusson), Master from Georgia Tech in 2017(was in Ph.D. program, worked with Prof. Karsten Schwan memorial page ), and Bachelor from Tsinghua University in 2015. I transferred to Emory in 2018 as a post-qualified Ph.D. student.
Services & Awards
Publication
Thesis:
Measurement and Analysis Methods of Performance Problems in Distributed SystemsEmory University
  Thesis
Papers:
Minder: Faulty Machine Detection for Large-scale Distributed Model TrainingYangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song,
Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu
In USENIX NSDI 2025
  Paper
 
  Slides
Yazhuo Zhang, Rebecca Isaacs, Yao Yue, Juncheng Yang, Lei Zhang, Ymir Vigfusson
In ACM SoCC 2023
  Paper
 
  Code
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace
In USENIX NSDI 2023
  Paper
 
  Slides
 
  Code
 
  Benchmark Code
 
  Video
Lei Zhang, Juncheng Yang, Anna Blasiak, Mike McCall, Ymir Vigfusson
In USENIX HotCloud 2020
  Paper
 
  Slides
 
  Video
Lei Zhang, Reza Karimi, Irfan Ahmad, Ymir Vigfusson
In ACM SIGMETRICS 2020
  Paper
 
  Video
  Kenneth C. Sevcik Outstanding Student Paper Award
Lei Zhang, Douglas Blough
In IEEE International Conference on Dependable Systems and Networks (DSN) 2018
  Paper
Maomeng Su, Lei Zhang, Yongwei Wu, Kang Chen, Keqin Li
In IEEE Transactions on Computers 2016, 65(6): 1964-1977.
  Paper
Under Review:
Tracing Dependencies in Collective Communication Towards Reliable LLM TrainingTowards Bandwidth-adaptive Live Volumetric Video Conferencing
A Lightweight Telemetry System with Service Tracing for Locating Network Slowdowns
Automatic Instrumentation for Fine-grained Observability in Distributed Systems