Gradient Centralization: A New Optimization Technique for Deep Neural Networks
Optimization techniques are of great importance to eﬀectively and eﬃciently train a deep neural network (DNN). It has been shown that using the ﬁrst and second order statistics (e.g., mean and variance) to perform Z-score standardization on network activations or weight vectors, such as batch normalization (BN) and weight standardization (WS), can improve the training performance. Diﬀerent from those previous methods that mostly operate on activations or weights, we present a new optimization technique, namely gradient centralization (GC), which operates directly on gradients by centralizing the gradient vectors to have zero mean. GC can be viewed as a projected gradient descent method with a constrained loss function. We show that GC can regularize both the weight space and output feature space so that it can boost the generalization performance of DNNs. Moreover, GC improves the Lipschitzness of the loss function and its gradient so that the training process becomes more eﬃcient and stable. GC is very simple to implement and can be easily embedded into existing gradient based DNN optimizers with only one line of code. It can also be directly used to ﬁne-tune the pre-trained DNNs. Our experiments on various applications, including general image classiﬁcation, ﬁne-grained image classiﬁcation, detection and segmentation, demonstrate that GC can consistently improve the performance of DNN learning."