Journal Articles

Nonlinear acceleration of momentum and primaldual algorithmsR. Bollapragada, D. Scieur, A. d’Aspremont.In Mathematical Programming (2022).We describe convergence acceleration schemes for multistep optimization algorithms. The extrapolated solution is written as a nonlinear average of the iterates produced by the original optimization method. Our analysis does not need the underlying fixedpoint operator to be symmetric, hence handles e.g. algorithms with momentum terms such as Nesterov’s accelerated method, or primaldual methods. The weights are computed via a simple linear system and we analyze performance in both online and offline modes. We use Crouzeix’s conjecture to show that acceleration performance is controlled by the solution of a Chebyshev problem on the numerical range of a nonsymmetric operator modeling the behavior of iterates near the optimum. Numerical experiments are detailed on logistic regression problems.

Regularized nonlinear accelerationD. Scieur, A. d’Aspremont, F. Bach.In Mathematical Programming (2020).We describe a convergence acceleration technique for unconstrained optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.

Application of Stochastic Dual Dynamic Programming to the RealTime Dispatch of Storage under Renewable Supply UncertaintyA. Papavasiliou, Y. Mou, L. Cambier, D. Scieur.In IEEE Transactions on Sustainable Energy (2017).This paper presents a multistage stochastic programming formulation of transmissionconstrained economic dispatch subject to multiarea renewable production uncertainty, with a focus on optimizing the dispatch of storage in realtime operations. This problem is resolved using stochastic dual dynamic programming. The applicability of the proposed approach is demonstrated on a realistic case study of the German power system calibrated against the solar and wind power integration levels of 20132014, with a 24hour horizon and 15minute time step. The value of the stochastic solution relative to the cost of a deterministic policy amounts to 1.1%, while the value of perfect foresight relative to the cost of the stochastic programming policy amounts to 0.8%. The relative performance of various alternative realtime dispatch policies is analyzed, and the sensitivity of the results is explored.
Conference Articles

Superacceleration with cyclical stepsizesB. Goujaud, D. Scieur, A. Dieuleveut, A. B. Taylor, F. Pedregosa.In International Conference on Artificial Intelligence and Statistics (2022).We develop a convergencerate analysis of momentum with cyclical stepsizes. We show that under some assumption on the spectral gap of Hessians in machine learning, cyclical stepsizes are provably faster than constant stepsizes. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with a given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for the proposed methods and illustrate our findings through benchmarks on least squares and logistic regression problems.

Only tails matter: AverageCase Universality and Robustness in the Convex RegimeL. Cunha, G. Gidel, F. Pedregosa, C. Paquette, D. Scieur.In International Conference on Machine Learning (2022).The recently developed averagecase analysis of optimization methods allows a more finegrained and representative convergence analysis than usual worstcase results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem’s asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worstcase scenario convergence and the restrictive previous averagecase analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov’s scheme, and we show that, in the averagecase context, Nesterov’s method is universally nearly optimal asymptotically.

Affine invariant analysis of frankwolfe on strongly convex setsT. Kerdreux, L. Liu, S. LacosteJulien, D. Scieur.In International Conference on Machine Learning (2021).It is known that the FrankWolfe (FW) algorithm, which is affinecovariant, enjoys accelerated convergence rates when the constraint set is strongly convex. However, these results rely on normdependent assumptions, usually incurring nonaffine invariant bounds, in contradiction with FW’s affinecovariant property. In this work, we introduce new structural assumptions on the problem (such as the directional smoothness) and derive an affine invariant, normindependent analysis of FrankWolfe. Based on our analysis, we propose an affine invariant backtracking linesearch. Interestingly, we show that typical backtracking linesearches using smoothness of the objective function surprisingly converge to an affine invariant step size, despite using affinedependent norms in the step size’s computation. This indicates that we do not necessarily need to know the set’s structure in advance to enjoy the affineinvariant accelerated rate.

Generalization of QuasiNewton methods: application to robust symmetric multisecant updatesD. Scieur, L. Liu, T. Pumir, N. Boumal.In International Conference on Artificial Intelligence and Statistics (2021).QuasiNewton techniques approximate the Newton step by estimating the Hessian using the socalled secant equations. Some of these methods compute the Hessian using several secant equations but produce nonsymmetric updates. Other quasiNewton schemes, such as BFGS, enforce symmetry but cannot satisfy more than one secant equation. We propose a new type of quasiNewton symmetric update using several secant equations in a leastsquares sense. Our approach generalizes and unifies the design of quasiNewton updates and satisfies provable robustness guarantees.

Connecting Sphere Manifolds Hierarchically for RegularizationD. Scieur, Y. Kim.In International Conference on Machine Learning (2021).This paper considers classification problems with hierarchically organized classes. We force the classifier (hyperplane) of each class to belong to a sphere manifold, whose center is the classifier of its superclass. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fullyconnected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (ResNet and DenseNet) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and TinyImageNet).

Accelerating smooth games by manipulating spectral shapesAzizian Waı̈ss, D. Scieur, I. Mitliagkas, S. LacosteJulien, G. Gidel.In International Conference on Artificial Intelligence and Statistics (2020).We use matrix iteration theory to characterize acceleration in smooth games. We define the spectral shape of a family of games as the set containing all eigenvalues of the Jacobians of standard gradient dynamics in the family. Shapes restricted to the real line represent wellunderstood classes of problems, like minimization. Shapes spanning the complex plane capture the added numerical challenges in solving smooth games. In this framework, we describe gradientbased methods, such as extragradient, as transformations on the spectral shape. Using this perspective, we propose an optimal algorithm for bilinear games. For smooth and strongly monotone operators, we identify a continuum between convex minimization, where acceleration is possible using Polyak’s momentum, and the worst case where gradient descent is optimal. Finally, going beyond firstorder methods, we propose an accelerated version of consensus optimization.

Extragradient with player sampling for faster convergence in nplayer gamesS. Jelassi, C. DomingoEnrich, D. Scieur, A. Mensch, J. Bruna.In International Conference on Machine Learning (2020).Datadriven modeling increasingly requires to find a Nash equilibrium in multiplayer games, e.g. when training GANs. In this paper, we analyse a new extragradient method for Nash equilibrium finding, that performs gradient extrapolations and updates on a random subset of players at each iteration. This approach provably exhibits a better rate of convergence than full extragradient for nonsmooth convex games with noisy gradient oracle. We propose an additional variance reduction mechanism to obtain speedups in smooth convex games. Our approach makes extrapolation amenable to massive multiplayer settings, and brings empirical speedups, in particular when using a heuristic cyclic sampling scheme. Most importantly, it allows to train faster and better GANs and mixtures of GANs.

Averagecase acceleration for bilinear games and normal matricesC. DomingoEnrich, F. Pedregosa, D. Scieur.In International Conference on Learning Representations (2020).Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of averagecase analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing averagecase optimal firstorder methods for a subset of smooth games. We make the following three main contributions. First, we show that for zerosum bilinear games the averagecase optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially nonsymmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speedup compared to worstcase optimal algorithms. We illustrate our findings through benchmarks with a varying degree of mismatch with our assumptions.

Universal averagecase optimality of polyak momentumD. Scieur, F. Pedregosa.In International conference on machine learning (2020).Polyak momentum (PM), also known as the heavyball method, is a widely used optimization method that enjoys an asymptotic optimal worstcase complexity on quadratic objectives. However, its remarkable empirical success is not fully explained by this optimality, as the worstcase analysis – contrary to the averagecase – is not representative of the expected complexity of an algorithm. In this work we establish a novel link between PM and the averagecase analysis. Our main contribution is to prove that any optimal averagecase method converges in the number of iterations to PM, under mild assumptions. This brings a new perspective on this classical method, showing that PM is asymptotically both worstcase and averagecase optimal.

Acceleration through spectral density estimationF. Pedregosa, D. Scieur.In International Conference on Machine Learning (2020).We develop a framework for the averagecase analysis of random quadratic problems and derive algorithms that are optimal under this analysis. This yields a new class of methods that achieve acceleration given a model of the Hessian’s eigenvalue distribution. We develop explicit algorithms for the uniform, MarchenkoPastur, and exponential distributions. These methods are momentumbased algorithms, whose hyperparameters can be estimated without knowledge of the Hessian’s smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Through empirical benchmarks on quadratic and logistic regression problems, we identify regimes in which the the proposed methods improve over classical (worstcase) accelerated methods.

Nonlinear Acceleration of PrimalDual AlgorithmsR. Bollapragada, D. Scieur, A. d’Aspremont.In The 22nd International Conference on Artificial Intelligence and Statistics (2019).We describe a convergence acceleration scheme for multistep optimization algorithms. The extrapolated solution is written as a nonlinear average of the iterates produced by the original optimization algorithm. Our scheme does not need the underlying fixedpoint operator to be symmetric, hence handles e.g. algorithms with momentum terms such as Nesterov’s accelerated method, or primaldual methods such as ChambollePock. The weights are computed via a simple linear system and we analyze performance in both online and offline modes. We use Crouzeix’s conjecture to show that acceleration is controlled by the solution of a Chebyshev problem on the numerical range of a nonsymmetric operator modelling the behavior of iterates near the optimum. Numerical experiments are detailed on image processing and logistic regression problems.

Optimal Management of Storage for Offsetting Solar Power Uncertainty using Multistage Stochastic ProgrammingT. Kaneda, B. Losseau, A. Papavasiliou, D. Scieur, L. Cambier, P. Henneaux, N. Leemput.In Proceedings of 20th Power Systems Computation Conference (to appear) (2018).Africa has recently engaged in implementing an aggressive renewable energy integration plan. A major challenge in the deployment of renewable power is the management of excess energy. The use of battery storage has been considered as a technically attractive solution. This paper tackles this operational problem using stochastic dual dynamical programming. We present an opensource MATLAB toolbox for multistage stochastic programming which employs stochastic dual dynamic programming. We use the toolbox in order to compare the stochastic solution to a greedy policy which operates batteries without future foresight as a benchmark. We consider a case study of storage management in Burkina Faso. We quantify the benefits of the stochastic solution and test the sensitivity of our results to the optimization horizon of the stochastic program.

Integration Methods and Accelerated Optimization AlgorithmsD. Scieur, V. Roulet, F. Bach, A. d’Aspremont.In Advances In Neural Information Processing Systems (2017).We show that accelerated optimization methods can be seen as particular instances of multistep integration schemes from numerical analysis, applied to the gradient flow equation. In comparison with recent advances in this vein, the differential equation considered here is the basic gradient flow and we show that multistep schemes allow integration of this differential equation using larger step sizes, thus intuitively explaining acceleration results.

Nonlinear Acceleration of Stochastic AlgorithmsD. Scieur, A. d’Aspremont, F. Bach.In Advances In Neural Information Processing Systems (2017).Extrapolation methods use the last few iterates of an optimization algorithm to produce a better estimate of the optimum. They were shown to achieve optimal convergence rates in a deterministic setting using simple gradient iterates. Here, we study extrapolation methods in a stochastic setting, where the iterates are produced by either a simple or an accelerated stochastic gradient algorithm. We first derive convergence bounds for arbitrary, potentially biased perturbations, then produce asymptotic bounds using the ratio between the variance of the noise and the accuracy of the current point. Finally, we apply this acceleration technique to stochastic algorithms such as SGD, SAGA, SVRG and Katyusha in different settings, and show significant performance gains.

Regularized Nonlinear AccelerationD. Scieur, A. d’Aspremont, F. Bach.In Advances In Neural Information Processing Systems (2016).We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple and small linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.
Books

Acceleration methodsA. d’Aspremont, D. Scieur, A. Taylor, others.In Foundations and Trends in Optimization (2021).This monograph covers some recent advances in a range of acceleration techniques frequently used in convex optimization. We first use quadratic optimization problems to introduce two key families of methods, namely momentum and nested optimization schemes. They coincide in the quadratic case to form the Chebyshev method. We discuss momentum methods in detail, starting with the seminal work of Nesterov and structure convergence proofs using a few master templates, such as that for optimized gradient methods, which provide the key benefit of showing how momentum methods optimize convergence guarantees. We further cover proximal acceleration, at the heart of the Catalyst and Accelerated Hybrid Proximal Extragradient frameworks, using similar algorithmic patterns. Common acceleration techniques rely directly on the knowledge of some of the regularity parameters in the problem at hand. We conclude by discussing restart schemes, a set of simple techniques for reaching nearly optimal convergence rates while adapting to unobserved regularity parameters.
Theses

Acceleration in OptimizationD. Scieur (2018).In many different fields such as optimization, the performance of a method is often characterized by its rate of convergence. However, accelerating an algorithm requires a lot of knowledge about the problem’s structure, and such improvement is done on a casebycase basis. Many accelerated schemes have been developed in the past few decades and are massively used in practice. Despite their simplicity, such methods are usually based on purely algebraic arguments and often do not have an intuitive explanation. Recently, heavy work has been done to link accelerated algorithms with other fields of science, such as control theory or differential equations. However, these explanations often rely on complex arguments, usually using nonconventional tools in their analysis. One of the contributions of this thesis is a tentative explanation of optimization algorithms using the theory of integration methods, which has been well studied and enjoys a solid theoretical analysis. In particular, we will show that optimization scheme are special instance of integration methods when integrating the classical gradient flow. With standard arguments, we intuitively explain the origin of acceleration. On the other hand, accelerated methods usually need additional parameters in comparison with slower one, which are in most cases difficult to estimate. In addition, these schemes are designed for one particular setting and cannot be used elsewhere. In this thesis, we explore a new approach for accelerating optimization algorithms, which uses generic acceleration arguments. In numerical analysis, these tools have been developed for accelerating sequences of scalars or vectors, by building on the side another sequence with a better convergence rate. These methods can be combined with an iterative algorithm, speeding it up in most cases. In practice, extrapolation schemes are not widely used due to their lack of theoretical guarantees and their instability. We will extend these methods by regularizing them, allowing a deeper theoretical analysis and stronger convergence results, especially when applied to optimization methods.

Global Complexity Analysis For The SecondOrder MethodsD. Scieur (2015).The goal of this master thesis will be firstly to analyze with precision the behavior of the cubic regularization of Newton’s method on strongly convex functions. During this analysis we will find that the regularisation does not work as well as expected when the function is also smooth, unlike the gradient method. We will thus propose some variants of the original algorithm in order to have better global performances.
Arxiv

Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent–an Open ProblemR. L. Priol, F. Kunstner, D. Scieur, S. LacosteJulien.In arXiv preprint arXiv:2111.06826 (2021).We consider the problem of upper bounding the expected loglikelihood suboptimality of the maximum likelihood estimate (MLE), or a conjugate maximum a posteriori (MAP) for an exponential family, in a nonasymptotic way. Surprisingly, we found no general solution to this problem in the literature. In particular, current theories do not hold for a Gaussian or in the interesting few samples regime. After exhibiting various facets of the problem, we show we can interpret the MAP as running stochastic mirror descent (SMD) on the loglikelihood. However, modern convergence results do not apply for standard examples of the exponential family, highlighting holes in the convergence literature. We believe solving this very fundamental problem may bring progress to both the statistics and optimization communities.

Generalized Framework for Nonlinear AccelerationD. Scieur.In arXiv preprint arXiv:1903.08764 (2019).Nonlinear acceleration algorithms improve the performance of iterative methods, such as gradient descent, using the information contained in past iterates. However, their efficiency is still not entirely understood even in the quadratic case. In this paper, we clarify the convergence analysis by giving general properties that share several classes of nonlinear acceleration: Anderson acceleration (and variants), quasiNewton methods (such as Broyden TypeI or TypeII, SR1, DFP, and BFGS) and Krylov methods (Conjugate Gradient, MINRES, GMRES). In particular, we propose a generic family of algorithms that contains all the previous methods and prove its optimal rate of convergence when minimizing quadratic functions. We also propose multisecants updates for the quasiNewton methods listed above. We provide a Matlab code implementing the algorithm.

Online Regularized Nonlinear AccelerationD. Scieur, E. Oyallon, A. d’Aspremont, F. Bach.In arXiv preprint arXiv:1805.09639 (2018).Regularized nonlinear acceleration (RNA) estimates the minimum of a function by postprocessing iterates from an algorithm such as the gradient method. It can be seen as a regularized version of Anderson acceleration, a classical acceleration scheme from numerical analysis. The new scheme provably improves the rate of convergence of fixed step gradient descent, and its empirical performance is comparable to that of quasiNewton methods. However, RNA cannot accelerate faster multistep algorithms like Nesterov’s method and often diverges in this context. Here, we adapt RNA to overcome these issues, so that our scheme can be used on fast algorithms such as gradient methods with momentum. We show optimal complexity bounds for quadratics and asymptotically optimal rates on general convex minimization problems. Moreover, this new scheme works online, i.e., extrapolated solution estimates can be reinjected at each iteration, significantly improving numerical performance over classical accelerated methods.

Nonlinear acceleration of cnnsD. Scieur, E. Oyallon, A. d’Aspremont, F. Bach.In arXiv preprint arXiv:1806.00370 (2018).The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleration method capable of improving the rate of convergence of many optimization schemes such as gradient descend, SAGA or SVRG. Until now, its analysis is limited to convex problems, but empirical observations shows that RNA may be extended to wider settings. In this paper, we investigate further the benefits of RNA when applied to neural networks, in particular for the task of image recognition on CIFAR10 and ImageNet. With very few modifications of exiting frameworks, RNA improves slightly the optimization process of CNNs, after training.