Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models
Xuxi Chen, Tianlong Chen, Yu Cheng, Weizhu Chen, Ahmed Awadallah, Zhangyang Wang
"Learning to optimize (L2O) has gained increasing attention since it demonstrates a promising path to automating and accelerating the optimization of complicated problems. Unlike manually crafted classical optimizers, L2O parameterizes and learns optimization rules in a data-driven fashion. However, the primary barrier, scalability, persists for this paradigm: as the typical L2O models create massive memory overhead due to unrolled computational graphs, it disables L2O’s applicability to large-scale tasks. To overcome this core challenge, we propose a new scalable learning to optimize (SL2O) framework which (i) first constrains the network updates in a tiny subspace and (ii) then explores learning rules on top of it. Thanks to substantially reduced trainable parameters, learning optimizers for large-scale networks with a single GPU become feasible for the first time, showing that the scalability roadblock of applying L2O to training large models is now removed. Comprehensive experiments on various network architectures (i.e., ResNets, VGGs, ViTs) and datasets (i.e., CIFAR, ImageNet, E2E) across vision and language tasks, consistently validate that SL2O can achieve significantly faster convergence speed and competitive performance compared to analytical optimizers. For example, our approach converges 3.41 4.60 times faster on CIFAR-10/100 with ResNet-18, and 1.24 times faster on ViTs, at nearly no performance loss. Codes are included in the supplement."