First-Order Algorithms for Communication Efficient Distributed Learning

Abstract

Innovations in numerical optimization, statistics and high performance computing have enabled tremendous advances in machine learning algorithms, fuelling applications from natural language processing to autonomous driving.To deal with increasing data volumes, and to keep the training times of increasingly complex machine learning models reasonable, modern optimization algorithms distribute both data and computations over a large number of machines. However, these algorithms face challenges, from hardware and data heterogeneity (as different machines have different processors and data) to privacy and security issues (where data can be extracted from the transmitted parameters).Among these challenges, communication overhead constitutes a major performance bottleneck of the algorithms.Communicating millions of problem parameters between machines has been reported to consumeup to 80% of the training time. To alleviate the communication bottleneck, we develop theory and strategies in this thesis to design communication-efficient optimization algorithms. In the first part, we provide unified analysis frameworks for first-order algorithms using direct or error-compensated compression. We first identify key characteristics of the compression errors induced by many of the most popular compression strategies in the literature. We then perform a unified analysis of the convergence properties of first-order algorithms for general families of deterministic and stochastic gradient compression algorithms.Our results give explicit expressions for how compression accuracy and the amount of asynchrony affect the step-sizes and guaranteed convergence times. We next turn our attention to error-compensated compression algorithms.We develop theoretical explanations for why error-compensated compression algorithms attain solutions with arbitrarily higher accuracy than direct compression algorithms.Our results provide strong convergence guarantees of error-compensated compression algorithms for distributed and federated learning problems. In the second part, we provide flexible tuning frameworks to optimize convergence performance of compression algorithms for a variety of system architectures. We start by analyzing data-dependent complexity that explains why direct compression algorithms are more communication-efficient than full-precision algorithms in practice. This complexity leads to automatic tuning strategies that enable popular compression algorithms on different communication networks to maximize both the convergence progress towards the solution and the communication efficiency. We then turn our attention to diminishing step-size schedules to maximize the convergence speed of the algorithms using noisy gradients.Our analysis framework is based on two classes of systems that characterize the impact of the step-sizes on the speed of noisy gradient algorithms.Our results show that such step-size schedules enable these algorithms to enjoy the optimal rate. Applications of the algorithms in the thesis to central machine learning problems on benchmark data validate our theoretical results.

Publication
PhD Thesis:: Stockholm: KTH Royal Institute of Technology, 2022. , p. 233