Better Than ReLU?

Introduction

A key component of any neural network is a nonlinear activation function. Due to its simplicity and reliability, the current standard for nonlinear activation functions is the rectified linear unit, or ReLU (Hahnloser et al., 2000; Jarrett et al., 2009; Nair & Hinton, 2010), defined as \(f(x) = \max(x, 0).\) Attempts have been made to construct variations on or alternatives to ReLU, but wide adoption of activation functions other than ReLU remains low.

Recent search algorithm based approaches have shown to produce novel activation functions (Zoph & Le, 2016; Bello et al., 2017; Zoph et al., 2017). A variation on the sigmoid linear unit, or SiLU (Hendrycks & Gimpel, 2016) has been found to consistently outperform ReLU and other popular activation functions in accuracy metrics by up to one percent (Ramachandran et al., 2017) on various learning tasks, such as image classification and machine translation.

In this work we analyze and validate a number of activation functions discovered by search algorithm based approaches. In particular, we take a closer look at the SiLU function, comparing it and other novel activation functions to ReLU, as well as popular activation functions like Leaky ReLU (Maas et al., 2013), Parametric ReLU (He et al., 2015), and Softplus (Nair & Hinton, 2010).

Literature Review

Ramachandran et. al. employ automated search techniques to discover and evaluate activation functions. This is done by designing a search space consisting of different compositions of unary and binary functions, and using an RNN controller to generate sequences of component functions to compose together. The candidate activation function was then used to train and then evaluate the performance of a child neural network on a learning task, in this case image classification on the CIFAR10 (Krizhevsky & Hinton, 2009) dataset using a ResNet-20 (He et al., 2016) model.

Among a wide variety of novel activation functions discovered by Ramachandran et. al., the best performing functions:

are simple compositions of a small number of components,
use the untransformed pre-activation as a component in a binary function; that is, have the form \(b(x, g(x)),\) and
do not use division.

The strongest performing activation function discovered was a variation on SiLU, defined as \(f(x) = x \sigma(x).\) It is structurally very similar to ReLU, and able to directly replace ReLU without adjustments to the model architecture. Subsequent empirical validation experiments showed that SiLU outperformed a wide variety of both newly discovered activation functions, as well as popular activation functions presented in other papers.

Methodology

We reproduce selected experiments discussed by Ramachandran et al. to empirically validate the results presented. Several ResNet models were trained on an image classification task using the CIFAR10 dataset, varying the activation function used, as well as the depth of the model. Training loss, test loss, and test accuracy were recorded for each experiment, in addition to total training time. For all experiments we use the model with ReLU as a baseline, the Adam optimizer with PyTorch defaults, and cross entropy as the loss function.

We adapt Alan Martyn's implementation of the ResNet architecture (Martyn, 2019), and use the Fashion MNIST and CIFAR-10 datasets available as part of the PyTorch built-in datasets library for initial exploratory analysis, and main empirical validations, respectively.

Choice of activation function

From the novel activation functions discovered by the search algorithm used by Ramachandran et al., we select three to validate. These are SiLU, \(\max(x, \sigma(x)),\) and \(\cos x - x.\)

Additionally, from the other popular activation functions that are also presented, we select three more to validate. These are Leaky ReLU (LReLU), Parametric ReLU (PReLU), and Softplus.

SiLU parameter

Ramachandran et al. additionally propose a modification to SiLU that includes a tunable or learnable parameter \(\beta\), used to scale the preactivation value, resulting in the function \(f(x)=x \sigma(\beta x).\) We examine \(\beta\) values of 0.5, 1.0 (as regular SiLU), 1.5, and 2.0 using a ResNet-20 model, and additionally allow the network to learn an optimal \(\beta\) value.

Model parameters

We vary the depth of the model to explore how the performance of each activation function scales with the model. Specifically, we train and evaluate each activation function on ResNet models of 20, 32, 44, and 56 layers. Due to limited training resources and time, we were unable to validate on larger models.

Experimental Findings

All experimental code and analysis is available on GitHub.

SiLU performance

The results of our empirical validation broadly support the results presented by Ramachandran et al. In Resnet-20, ResNet-32, and ResNet-44, SiLU achieved the highest test accuracy after 10 epochs, in most cases outperforming the other activation functions. Moreover, we observe lower test loss compared to other activation functions, potentially indicating that SiLU generalizes better.

Table 1: Activation function metrics in ResNet-20. See the Analysis notebook for all tables.
Activation Function	Accuracy	Train Loss	Test Loss
ReLU	0.8047	0.505950	0.595131
SiLU	0.8143	0.485753	0.553022
\(\cos x - x\)	0.7959	0.641438	0.587057
\(\max(x, \sigma(x))\)	0.7955	0.581675	0.614394
PReLU	0.8085	0.449701	0.598796
LReLU	0.8033	0.471236	0.604913
Softplus	0.7758	0.683996	0.671707

We observe that the two activation functions discovered by the search algorithm, \(\cos x - x\) and \(\max(x, \sigma(x)),\) underperformed in all experiments, regardless of model size. These results suggest that these functions make poor choices for activation functions.

Performance over training epochs

Speed of convergence is an important performance metric for any neural network. We visualize test accuracy, training loss, and test loss for each activation function over all training epochs.

Figure 1: SiLU test accuracy, training loss, and test loss compared to ReLU (above), discovered activation functions (center), and ReLU-like functions (below) over all training epochs, on ResNet-44.

SiLU consistently remains ahead over almost all metrics. In future work, with additional time and resources, it may be fruitful to investigate performance over additional epochs to analyze these trends further.

Model training times

Total training time is also an important metric to explore. Here we visualize differences in training time between the different activation functions for each ResNet model.

Figure 2: Comparison of total training times for all ResNet models and activation functions. Time is measured in seconds.

Interestingly, SiLU networks were consistently faster to train than their ReLU counterparts. In fact, SiLU consistently placed in the top two fastest activation functions in each experiment. This observation suggests that by replacing ReLU with SiLU in neural networks, higher accuracy can be achieved in less time.

ResNet models using \(\cos x - x\) and \(\max(x, \sigma(x))\) as activation functions consistently took the longest to train. This is likely because of the higher complexity of periodic and exponential functions, which require far more additions and multiplications to compute compared to ReLU and ReLU-like functions.

Variation of SiLU parameter

Our results suggest that varying \(\beta\) as a fixed or learnable parameter does not significantly influence test accuracy. While leaving \(\beta\) as a learnable parameter started with the highest accuracy, SiLU with fixed \(\beta = 2\) achieved the highest final test accuracy.

Figure 3: Test accuracy and loss metrics for different values of \(\beta.\)

Table 2: SiLU results for different values of \(\beta\).
Activation Function	Accuracy	Train Loss	Test Loss
Learnable \(\beta\) (lsilu)	0.8226	0.487016	0.533784
\(\beta = 0.5\) (silu0.5)	0.8151	0.564699	0.535829
\(\beta = 1.5\) (silu1.5)	0.8308	0.487803	0.508667
\(\beta = 2.0\) (silu2.0)	0.8317	0.477034	0.517551

We conclude that in most cases it suffices to use SiLU with the default value of \(\beta = 1.\)

Conclusions

Our results broadly confirm the conclusions drawn by Ramachandran et al. SiLU consistently outperforms ReLU and other popular activation functions over a variety of contexts and metrics. Most notably, direct replacement of ReLU with SiLU in ResNet models of different sizes resulted in test accuracy improvements of almost two percent. This is significant for what is essentially a one line code change.

Future directions for investigation include extending training to more than 10 epochs, validating a wider variety of novel activation functions discovered by other search based algorithms, and exploring reasons for the decline in SiLU performance as model size increases. Furthermore, in concurrence with Ramachandran et al., we note that further gains are likely to be observed when networks are designed specifically with SiLU in mind.