ImageNet Classification with Deep Convolutional Neural Networks

The development of deep learning techniques has significantly advanced the field of image recognition, with the AlexNet architecture marking a major milestone in this progress. In this article, we will explore the groundbreaking research paper “ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, the first paper that showed you could use CUDA, i.e. GPUs, to train deep neural networks efficiently, and thus kickstarted the deep learning revolution.

The need for larger datasets

Researchers faced challenges in training deep neural networks due to the limited amount of labelled data available with small image datasets, which were insufficient to capture the complexity and variability of real-world objects.

The introduction of ImageNet, with its vast collection of over 15 million labelled high-resolution images spanning more than 22,000 categories, addressed this limitation. Its sheer-size and diversity enabled researchers to train deep convolutional neural networks (CNNs) with a significantly larger learning capacity, more generalisations and more complex representations.

Larger datasets on their own were not enough to achieve high accuracy: Convolutional Neural Networks (CNNs) emerged as a powerful tool for image classification tasks due to their ability to leverage prior knowledge about the structure and characteristics of images. Unlike classic neural networks, which treat input data as a flat vector, CNNs are specifically designed to process grid-like data, such as images, in a more efficient and effective manner.

The Architecture

Training deep learning models necessitates not only sizable datasets but also a substantial amount of computing power. The authors experimented with a GTX 580 GPU, a state-of-the-art option at the time. However 1.2 million training examples are enough to train networks that are too big to fit on just 3 GBs of memory, the maximum storage of a GTX 580.

In response, the researchers had another very smart intuition: cross-GPU parallelization. Instead of relying on a single GPU, they employed two, distributing half of the neurons to each GPU. Moreover, the GPUs were allowed to communicate only at certain levels. In particular, communication was allowed at layer 3, where kernels took input from all kernel maps in layer 2. On the other hand, on the remaining convolutional layers, kernels received input exclusively from the kernel maps in the preceding layer residing on the same GPU.

The AlexNet Architecture consists of eight learned layers, including five convolutional layers and three fully-connected layers. This depth allows the network to learn hierarchical representations of features, capturing both low-level and high-level visual patterns, and its breadth, around 62 million parameters, enables it to model complex relationships between images.

The convolutional layers in AlexNet play a crucial role in extracting meaningful features from input images. The first convolutional layer filters the input image with 96 kernels of size 11x11x3, followed by a ReLU activation function. This layer is responsible for capturing low-level features such as edges and textures. Subsequent convolutional layers further refine the learned features, with the number of kernels decreasing but the depth increasing.

The fully-connected layers serve as the final layers of the network, responsible for classification. These layers take the output of the preceding convolutional layers and transform it into a vector of class probabilities using a softmax activation function.

ReLU Nonlinearity

Although researchers had long been using non-linear saturated activation functions for neurons, like Sigmoid or Tanh, the creators of the AlexNet Architecture found that a non-linear non-saturated activation function, Rectified Linear Unit (ReLU), would train about 6 times faster.

Although both parties are non-linear, feature which allows subsequent layers to build off each other and thus increase the overall prediction power of the network, the problem lies with the word “saturated”: saturated function suffer from the problem known as Vanishing Gradient, which means that the more iterations your model goes through, the more information it will lose each time, and thus the longer it will take for convergence.

Local Response Normalisation

Although ReLU helps with the vanishing gradient problem, due to its unbounded nature, the learned variables can become unnecessarily high the more iterations the model goes through. To prevent this we would need to “normalise” the values within a given range, namely between 0 and 1. The specific type of Normalisation used in the AlexNet architecture is called Inter-Channel “Local Response Normalisation” (LRN). The idea behind Inter-Channel LRN is to carry out a normalisation in a one dimensional neighbourhood of pixels amplifying the excited neuron while dampening the surrounding neurons at the same time. LRN was heavily influenced by the lateral inhibition scheme present in our own brain, and, although revolutionary, has been replaced in modern models with other Normalisation techniques, in particular Layer Normalisation used in Transformers, Group Normalisation and Batch Normalisation.

Kernel Activation

Kernel activation is also referred to as convolution operation because it characterises convolutional neural networks. A kernel is a tensor used for feature extraction. It’s typically a small grid of numbers that are initialised at random and then learned during the training. The convolution operation involves element-wise multiplication of the kernel’s values with a small region of the input data, followed by summation. The result of this operation is a single value, which represents the response or activation of the kernel at the specified position. Because the kernel map is the same for all neurons in the same layer, and convolution decreases the size of the input as it goes though it, CNNs have greater learning capabilities compared to deep neural networks, where different parameters must be computed for different neurons.

Overlapping Pooling

Pooling layers in CNNs summarise the outputs of neighbouring groups of neurons in the same kernel map.

A pooling layer consists of pooling units arranged in a grid, spaced s pixels apart, each summarising a z × z neighbourhood centred at its location. Traditional local pooling is achieved when setting s = z, while Overlapping Pooling is obtained with s < z. Notably, AlexNet utilises Overlapping Pooling with s = 2 and z = 3.

Overlapping pooling is crucial when working with image data, because it guarantees location invariance: a CNN can recognize a specific object in different parts of an image, as it learns to detect local patterns and features, irrespective of their position in the input data.

Reducing Overfitting

Overfitting is a common challenge in deep learning models, where the model performs well on the training data but fails to generalise to unseen data. The AlexNet architecture incorporates several techniques to mitigate overfitting and improve the model’s ability to generalise.

Data Augmentation

The first powerful technique is Data Augmentation, used to artificially increase the size of the training set by applying various transformations to the existing data. In AlexNet, two forms of data augmentation are employed;

The first method involves generating image translations and horizontal reflections. This is achieved by extracting random 224×224 patches from the 256×256 input images and training the network on these extracted patches. This augmentation increases the size of the training set by a factor of 2048, introducing variations in the position and orientation of objects.

The second form of data augmentation involves altering the intensities of the RGB channels in training images. This is done by performing Principal Component Analysis (PCA) on the set of RGB pixel values throughout the ImageNet training set. The learned principal components are then added to each training image, introducing variations in the colour and intensity of the images. This data augmentation scheme helps the network learn more robust features that are invariant to changes in illumination and colour.

Dropout Regularisation

Another powerful technique employed by the creators of AlexNet is Dropout Regularisation, used to prevent overfitting by reducing complex co-adaptations of neurons. During training, each neuron in the fully-connected layers is set to zero with a probability of 0.5, effectively dropping out half of the neurons for each batch. One has to understand that these “dropped out” neurons won’t contribute neither to the forward nor backward propagation. This forces the network to learn more robust features that are useful in conjunction with different subsets of neurons. Dropout regularisation helps improve the generalisation ability of the network by preventing over-reliance on specific neurons.

Weight Decay Regularisation

A further technique is Weight Decay Regularisation. It involves adding a penalty term to the loss function during training to encourage smaller weights. This regularisation term helps prevent the model from overemphasising certain features or parameters, leading to a more balanced and generalised model.

The combination of data augmentation, dropout regularisation, and weight decay helps the network learn more robust and discriminative features, ensuring improved performance on unseen data.

Results

Five ILSVRC-2010 test images in the first column. The remaining columns show the six training images that the model believes to be most similar to the images in the first column.

The performance of the AlexNet model is evaluated based on its top-1 and top-5 error rates. The top-1 error rate measures the percentage of test images for which the correct label is not the model’s top prediction. The top-5 error rate measures the percentage of test images for which the correct label is not among the top five predictions.

The top-1 error rate achieved by AlexNet was reported to be 37.5%, while the top-5 error rate was 17.0%. These results represented a substantial improvement compared to previous state-of-the-art models, which achieved a 47.1% top-1 and 28.2% top-5 error rates.

It is further interesting to see the effect of an additional convolutional layer in the AlexNet Architecture: although amounting to less than 1% of the total neurons in the network, an addition of one convolutional layer would decrease the top-1 and top-5 error rates by 2% and 1% respectively, thus further proving the need for “prior knowledge” when talking about image classification tasks.

Further Development

The introduction of AlexNet sparked a wave of research and development in the field of image recognition. Researchers began exploring more sophisticated architectures, such as VGGNet, GoogLeNet, and ResNet, which built upon the principles and techniques introduced by AlexNet. These architectures pushed the boundaries of accuracy and performance, achieving even lower error rates on the ImageNet dataset.

VGGnet introduced the notion of fixed-size kernels (3×3) to reduce the number of trainable variables by 44.9%, leading to faster learning and a more robust structure against over-fitting

Inception, or GoogLeNet, essentially went “wider” instead of deeper by introducing Inception Modules consisting of four parallel operations (3 convolutions and 1 pooling) alongside 3 other 1×1 convolutional layers to further reduce depth.

The creators of ResNet thought of a way to reduce the probability of the model to overcomplicate a task by introducing two different Shortcut Techniques. Although this seems counterintuitive, adding more layers, and thus processing more information, in a Neural Network could actually harm performance, thus “skipping” layers in a clever way will lead to higher accuracy.

In conclusion, AlexNet’s impact on the field of deep learning and image recognition cannot be overstated. Its groundbreaking results and innovative design principles have paved the way for numerous advancements and inspired further research in the field. The success of AlexNet continues to drive progress in deep learning and shape the future of artificial intelligence.

Authors: Agnese Adorante, Leonardo Vanni
1.5K Views
Scroll to top
Close