Momentum Batch Normalization for Deep Learning with Small Batch Size
Normalization layers play an important role in deep network training. As one of the most popular normalization techniques, batch normalization (BN) has shown its eﬀectiveness in accelerating the model training speed and improving model generalization capability. The success of BN has been explained from diﬀerent views, such as reducing internal covariate shift, allowing the use of large learning rate, smoothing optimization landscape, etc. To make a deeper understanding of BN, in this work we prove that BN actually introduces a certain level of noise into the sample mean and variance during the training process, while the noise level depends only on the batch size. Such a noise generation mechanism of BN regularizes the training process, and we present an explicit regularizer formulation of BN. Since the regularization strength of BN is determined by the batch size, a small batch size may cause the under-ﬁtting problem, resulting in a less eﬀective model. To reduce the dependency of BN on batch size, we propose a momentum BN (MBN) scheme by averaging the mean and variance of current mini-batch with the historical means and variances. With a dynamic momentum parameter, we can automatically control the noise level in the training process. As a result, MBN works very well even when the batch size is very small (e.g., 2), which is hard to achieve by traditional BN."