[2023.9] I'm joining Bytedance as a research scientist.
About me
I'm a Research Scientist at ByteDance, working on monitoring and troubleshooting large-scale RDMA and GPU datacenter networks, especially for LLM. My research interests are widely on distributed systems, network systems, cloud systems, and datacenter management. I've been working on the specific areas like:
- Improve observability and reliability for large scale systems, through distributed tracing and root cause analysis.
- Quantify, understand, and analyze performance for data placement in distributed cache/memory systems.
- Develop other large-scale applications e.g. video streaming and secure storage systems.
- RDMA and GPU datacenter network management and troubleshooting for LLM.
My research has been awarded with an ACM SIGMETRICS Kenneth C. Sevcik Outstanding Student Paper Award. I was a postdoc researcher at Princeton University, working with Prof. Ravi Netravali. Before that, I received my Ph.D. from Emory University in 2021 (working with Prof. Ymir Vigfusson), Master from Georgia Tech in 2017(was in Ph.D. program, worked with Prof. Karsten Schwan memorial page ), and Bachelor from Tsinghua University in 2015. I transferred to Emory in 2018 as a post-qualified Ph.D. student.
Publication
Thesis:
Measurement and Analysis Methods of Performance Problems in Distributed SystemsEmory University
  Thesis
Papers:
LatenSeer: Causal Modeling of End-to-End Latency Distributions by Harnessing Distributed TracingYazhuo Zhang, Rebecca Isaacs, Yao Yue, Juncheng Yang, Lei Zhang, Ymir Vigfusson
In ACM SoCC 2023
  Paper
 
Lei Zhang, Vaastav Anand, Zhiqiang Xie, Ymir Vigfusson, Jonathan Mace
In USENIX NSDI 2023
  Paper
 
  Slides
 
  Code
 
  Benchmark Code
Lei Zhang, Juncheng Yang, Anna Blasiak, Mike McCall, Ymir Vigfusson
In USENIX HotCloud 2020
  Paper
 
  Slides
 
  Video
Lei Zhang, Reza Karimi, Irfan Ahmad, Ymir Vigfusson
In ACM SIGMETRICS 2020
  Paper
 
  Video
  Kenneth C. Sevcik Outstanding Student Paper Award
Lei Zhang, Douglas Blough
In IEEE International Conference on Dependable Systems and Networks (DSN) 2018
  Paper
Maomeng Su, Lei Zhang, Yongwei Wu, Kang Chen, Keqin Li
In IEEE Transactions on Computers 2016, 65(6): 1964-1977.
  Paper