Sarit Khirirat received the Bechelors’ degree in Electrical Engineering from Chulalongkorn University, Thailand, in 2013; the Master’s degree in Systems, Control and Robotics from KTH Royal Institute of Technology, Sweden, in 2017; and the PhD at the Division of Decision and Control Systems from the same institution in 2022. During his PhD, he did research on optimization in communication-efficient applications under the supervision of Professor Mikael Johansson. This research program was funded by the Wallenberg AI, Autonomous Systems and Software program, Sweden’s largest individual research funding program. He is now a postdoctoral researcher at King Abdullah University of Science and Technology (KAUST) advised by Professor Peter Richtárik.
Interests
Numerical Optimization
Machine Learning
Federated Learning
Energy-aware IoT Systems
Education
PhD at Division of Decision and Control, EECS, 2022
KTH Royal Institute of Technology
MSc in Systems, Control and Robotics, 2017
KTH Royal Insitute of Technology
BSc in Electrical Engineering, 2013
Chulalongkorn University
Experience
Summer Intern
Yokogawa, Thailand, Ltd.
Jun 2012 –
Jul 2012
Bangkok, Thailand
Distributed control and automation systems for large-scale chemical processes
Programmed with CENTRUM VP, PLC, SCADA and AutoCAD
Installed communications networks and power lines of distributed control systems
Inspected electrical and field instrument panels, and their blueprints
Noisy gradient algorithms have emerged as one of the most popular algorithms for distributed optimization with massive data. Choosing proper step-size schedules is an important task to tune in the algorithms for good performance. For the algorithms to attain fast convergence and high accuracy, it is intuitive to use large step-sizes in the initial iterations when the gradient noise is typically small compared to the algorithm-steps, and reduce the step-sizes as the algorithm progresses. This intuition has been confirmed in theory and practice for stochastic gradient descent. However, similar results are lacking for other methods using approximate gradients. This paper shows that the diminishing step-size strategies can indeed be applied for a broad class of noisy gradient algorithms. Our analysis framework is based on two classes of systems that characterize the impact of the step-sizes on the convergence performance of many algorithms. Our results show that such step-size schedules enable these algorithms to enjoy the optimal rate. We exemplify our results on stochastic compression algorithms. Our experiments validate fast convergence of these algorithms with the step decay schedules.
Innovations in numerical optimization, statistics and high performance computing have enabled tremendous advances in machine learning algorithms, fuelling applications from natural language processing to autonomous driving.To deal with increasing data volumes, and to keep the training times of increasingly complex machine learning models reasonable, modern optimization algorithms distribute both data and computations over a large number of machines. However, these algorithms face challenges, from hardware and data heterogeneity (as different machines have different processors and data) to privacy and security issues (where data can be extracted from the transmitted parameters).Among these challenges, communication overhead constitutes a major performance bottleneck of the algorithms.Communicating millions of problem parameters between machines has been reported to consumeup to 80% of the training time. To alleviate the communication bottleneck, we develop theory and strategies in this thesis to design communication-efficient optimization algorithms.
In the first part, we provide unified analysis frameworks for first-order algorithms using direct or error-compensated compression. We first identify key characteristics of the compression errors induced by many of the most popular compression strategies in the literature. We then perform a unified analysis of the convergence properties of first-order algorithms for general families of deterministic and stochastic gradient compression algorithms.Our results give explicit expressions for how compression accuracy and the amount of asynchrony affect the step-sizes and guaranteed convergence times. We next turn our attention to error-compensated compression algorithms.We develop theoretical explanations for why error-compensated compression algorithms attain solutions with arbitrarily higher accuracy than direct compression algorithms.Our results provide strong convergence guarantees of error-compensated compression algorithms for distributed and federated learning problems.
In the second part, we provide flexible tuning frameworks to optimize convergence performance of compression algorithms for a variety of system architectures. We start by analyzing data-dependent complexity that explains why direct compression algorithms are more communication-efficient than full-precision algorithms in practice. This complexity leads to automatic tuning strategies that enable popular compression algorithms on different communication networks to maximize both the convergence progress towards the solution and the communication efficiency. We then turn our attention to diminishing step-size schedules to maximize the convergence speed of the algorithms using noisy gradients.Our analysis framework is based on two classes of systems that characterize the impact of the step-sizes on the speed of noisy gradient algorithms.Our results show that such step-size schedules enable these algorithms to enjoy the optimal rate.
Applications of the algorithms in the thesis to central machine learning problems on benchmark data validate our theoretical results.
Federated learning is an emerging framework for collaborative machine-learning on devices which do not want to share local data. State-of-the art methods in federated learning reduce the communication frequency, but are not guaranteed to converge to the optimal model parameters. These methods also experience a communication bottleneck, especially when the devices are power-constrained and communicate over a shared medium. This paper presents ECO-FedSplit, an algorithm that increases the communication efficiency of federated learning without sacrificing solution accuracy. The key is to compress inter-device communication and to compensate for information losses in a theoretically justified manner. We prove strong convergence properties of ECO-FedSplit on strongly convex optimization problems and show that the algorithm yields a highly accurate solution with dramatically reduced communication. Extensive numerical experiments validate our theoretical result on real data sets.
Zeroth-order methods have become important tools for solving problems where we have access only to function evaluations. However, the zeroth-order methods only using gradient approximations are n times slower than classical first-order methods for solving n-dimensional problems. To accelerate the convergence rate, this paper proposes the zeroth order randomized subspace Newton (ZO-RSN) method, which estimates projections of the gradient and Hessian by random sketching and finite differences. This allows us to compute the Newton step in a lower dimensional subspace, with small computational costs. We prove that ZO-RSN can attain lower iteration complexity than existing zeroth order methods for strongly convex problems. Our numerical experiments show that ZO-RSN can perform black-box attacks under a more restrictive limit on the number of function queries than the state-of-the-art Hessian-aware zeroth-order method.
With the increasing scale of machine learning tasks, it has become essential to reduce the communication between computing nodes. Early work on gradient compression focused on the bottleneck between CPUs and GPUs, but communication-efficiency is now needed in a variety of different system architectures, from high-performance clusters to energy-constrained IoT devices. In the current practice, compression levels are typically chosen before training and settings that work well for one task may be vastly suboptimal for another dataset on another architecture. In this paper, we propose a flexible framework which adapts the compression level to the true gradient at each iteration, maximizing the improvement in the objective function that is achieved per communicated bit. Our framework is easy to adapt from one technology to the next by modeling how the communication cost depends on the compression level for the specific technology. Theoretical results and practical experiments indicate that the automatic tuning strategies significantly increase communication efficiency on several state-of-the-art compression schemes.
The emergence of big data has caused a dramatic shift in the operating regime for optimization algorithms. The performance bottleneck, which used to be computations, is now often communications. Several gradient compression techniques have been proposed to reduce the communication load at the price of a loss in solution accuracy. Recently, it has been shown how compression errors can be compensated for in the optimization algorithm to improve the solution accuracy. Even though convergence guarantees for error-compensated algorithms have been established, there is very limited theoretical support for quantifying the observed improvements in solution accuracy. In this paper, we show that Hessian-aided error compensation, unlike other existing schemes, avoids accumulation of compression errors on quadratic problems. We also present strong convergence guarantees of Hessian-based error compensation for stochastic gradient descent. Our numerical experiments highlight the benefits of Hessian-based error compensation, and demonstrate that similar convergence improvements are attained when only a diagonal Hessian approximation is used.
Technological developments in devices and storages have made large volumes of data collections more accessible than ever. This transformation leads to optimization problems with massive data in both volume and dimension. In response to this trend, the popularity of optimization on high performance computing architectures has increased unprecedentedly. These scalable optimization solvers can achieve high efficiency by splitting computational loads among multiple machines. However, these methods also incur large communication overhead. To solve optimization problems with millions of parameters, communication between machines has been reported to consume up to 80% of the training time. To alleviate this communication bottleneck, many optimization algorithms with data compression techniques have been studied. In practice, they have been reported to significantly save communication costs while exhibiting almost comparable convergence as the full-precision algorithms. To understand this intuition, we develop theory and techniques in this thesis to design communication-efficient optimization algorithms.
In the first part, we analyze the convergence of optimization algorithms with direct compression. First, we outline definitions of compression techniques which cover many compressors of practical interest. Then, we provide the unified analysis framework of optimization algorithms with compressors which can be either deterministic or randomized. In particular, we show how the tuning parameters of compressed optimization algorithms must be chosen to guarantee performance. Our results show explicit dependency on compression accuracy and delay effect due to asynchrony of algorithms. This allows us to characterize the trade-off between iteration and communication complexity under gradient compression.
In the second part, we study how error compensation schemes can improve the performance of compressed optimization algorithms. Even though convergence guarantees of optimization algorithms with error compensation have been established, there is very limited theoretical support which guarantees improved solution accuracy. We therefore develop theoretical explanations, which show that error compensation guarantees arbitrarily high solution accuracy from compressed information. In particular, error compensation helps remove accumulated compression errors, thus improving solution accuracy especially for ill-conditioned problems. We also provide strong convergence analysis of error compensation on parallel stochastic gradient descent across multiple machines. In particular, the error-compensated algorithms, unlike direct compression, result in significant reduction in the compression error.
Applications of the algorithms in this thesis to real-world problems with benchmark data sets validate our theoretical results.
The veritable scale of modern data necessitates information compression in parallel/distributed big-data optimization. Compression schemes using memory-based error compensation have displayed superior performance in practice, however, to date there are no theoretical explanations for these observed advantages. This paper provides the first theoretical support for why such compression schemes yields higher accuracy solutions in optimization. Our results cover both gradient and incremental gradient algorithms for quadratic optimization. Unlike previous works, our theoretical results explicitly quantify the accuracy gains from error compensation, especially for ill-conditioned problems. Finally, the numerical results on linear least-squares problems validate the benefit of error compensation and demonstrate tightness of our convergence guarantees.
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods–where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally–are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up tomph {three orders of magnitude}, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.
Our article titled “Improved Step-Size Schedules for Proximal Noisy Methods” to the IEEE TSP journal in 2023.
Abstract:Noisy gradient algorithms have emerged as one of the most popular algorithms for distributed optimization with massive data. Choosing proper step-size schedules is an important task to tune in the algorithms for good performance. For the algorithms to attain fast convergence and high accuracy, it is intuitive to use large step-sizes in the initial iterations when the gradient noise is typically small compared to the algorithm-steps, and reduce the step-sizes as the algorithm progresses.
Our article titled “Improved Step-size Schedules for Noisy Gradient Methods” has been accepted to the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in 2021.
Abstract: Noise is inherited in many optimization methods such as stochastic gradient methods, zeroth-order methods and compressed gradient methods. For such methods to converge toward a global optimum, it is intuitive to use large step-sizes in the initial iterations when the noise is typically small compared to the algorithm-steps, and reduce the step-sizes as the algorithm progresses.
Our article titled “A Flexible Framework for Communication-Efficient Machine Learning” has been accepted (in the poster session) to the 35th AAAI conference in 2021.
Abstract: With the increasing scale of machine learning tasks, it has become essential to reduce the communication between computing nodes. Early work on gradient compression focused on the bottleneck between CPUs and GPUs, but communication-efficiency is now needed in a variety of different system architectures, from high-performance clusters to energy-constrained IoT devices.