Deep learning and computer vision, explained – AI for Dummies (3/4)
AI for Dummies (3/4)
The Unexpected Journey
Welcome to part three of our 4-part series on the latest trends and applications of Computer Vision. In case you missed the previous episodes, click here for Part 1 and Part 2!
Last week, we learned that deep learning algorithms are always made up of the same elemental bricks: input, hidden, and output layers, as well as computing units that we call neurons. When these layers are assembled together, just like LEGO® bricks, they can be used to answer a specific question. Just before the algorithm can start answering questions, however, we saw that it has to be trained using either supervised or unsupervised learning. This week we’re gonna start by taking a closer look at supervised learning — which as you all remember I’m sure — entails providing the algorithm with labelled training data to learn from. But how exactly does our algorithm learn from these examples?
Let’s go back to the neurons.
There and Back Again: A Neuron’s Tale
Each neuron in an algorithm is unique, but the role of each neuron is essentially the same: it receives, transforms, and passes information from the previous layer of connected neurons to the next. Each layer of neurons can handle more and more complex information. In this way, the input features are fed forward through from one layer to the next, until the output layer, where the result, that is the answer to your question, is transferred to the outside world.
In practice, the transmission of the signal from one neuron occurs by combining the information received from the connected neurons in the previous layer by simultaneously weighing them, i.e. controlling the strength of their influence. The new sum of weighed inputs is then transformed in a non-linear fashion to an output signal that the next neuron can process. The non-linearity is important since without it the learning of complex features would be impossible.
But how does the neuron know which weight to apply? Just like a child, it learns which features are important by making mistakes. In the beginning, the weights are chosen randomly, they are then adjusted depending on the more training examples it sees. This is done in two steps:
- the mismatch between the algorithm’s prediction and the ground truth is evaluated, i.e. the output of defined loss function is calculated, and
- the error is then propagated backwards from the output layer to the first hidden layer through the entire network and the corresponding weight correction is calculated for each neuron.
The entire process is repeated over the whole training dataset for several iterations, called epochs. In order to quickly find the optimal weights for each neuron, the backward propagation of errors is used in conjunction with an optimisation method to minimise the loss function, i.e. the error. The most common optimisation methods are based on a gradient descent approach, where the gradient — slope — of the defined loss function is estimated in relation to all the current weights in the network. These are used in turn to update the weights.
The desolation of a data scientist
We now have everything to build and train a deep learning algorithm. But the success of your training depends on how well your algorithm can generalise, i.e. whether or not it can provide the correct answer when presented with unseen examples. You can control the algorithm’s behaviour by playing around with its ingredients, the so called hyperparameters. On one hand, you have the number of bricks you can stack: the number of hidden layers and of neurons in each hidden layer. On the other hand, you have to carefully choose the type of activation function for each neuron in order to learn the complex features. Finally, you can also influence the learning process itself by slowing or speeding it up. But be aware, if the algorithm learns too fast it might miss the minimum of the loss function and will therefore be less accurate, and if it learns too slowly, it might never find the optimal weights.
Tweaking your deep learning algorithm is an iterative process. You start with some configuration values and you use them to train a model. Depending on the initial output performance of your model you will change the values to train a new model, and depending on the new output performance further change the values… This cycle can be quite long. To efficiently find the best working values for your algorithm, the best approach is to split your dataset into three independent sets: a training dataset (for training your algorithm), a validation or development dataset (for evaluating your trained algorithm in the tweaking phase which the training algorithm does not observe), and a test dataset of unseen examples (for the evaluation of the final algorithm). The percentage of the different splits is entirely up to you and your data. You will need sufficient data in your validation and test dataset for statistically representative results, and enough examples to teach your complex problem to the algorithm.
Once you have trained a model, you can compare the error of your training dataset to pure chance and human performance to understand if the algorithm learned the task correctly or not. Let’s assume that we trained an algorithm to classify cats and dogs and that 15% of the training dataset was falsely classified. Here the algorithm seems to have not correctly learnt to distinguish cats from dogs compared to a human with an error rate of 0%. It under-fitted the data. If this occurs, you can try a bigger network, i.e. add more hidden layers and/or neurons in each layer. Increasing the number of layers and therefore the number of neurons, allows your algorithm to capture more complexity. But sometimes the problem is simply that you haven’t given the algorithm enough time to correctly learn the features. So by increasing the number of epochs, you just might get lucky!
If your training error is sufficiently small, you can then compare it to the error from the validation set to see if your algorithm is a victim of overfitting. The algorithm learned every quirk and noise from the training dataset and became overspecialised. Let’s assume that the algorithm has an error of only 1% on the training dataset, whereas on the validation dataset we have an error of 10%. This large discrepancy is most likely due to the algorithm’s overspecialisation. Therefore when the algorithm sees unknown examples from the validation dataset, it becomes confused and misclassifies the images. In this case, you can try to increase, diversify, and clean your training dataset. This allows your algorithm to learn more general features instead of becoming too specialised to outliers. Another option would be to reduce the number of neurons used and/or regularise the loss function.
The Mithril of Deep Learning: Convolutional Neural Networks
Artificial Neural Networks and deep learning techniques have been around for some time now. So why is there so much fuzz about deep learning at the moment?
Convolutional Neural Networks!
Deep neural networks with fully connected layers are computationally expensive. In the case of image related problems, this is a HUGE problem. Imagine you want to input 200 x 200 pixel colour images, each single neuron would need to have 120000 weights!
What is different then with convolutional neural networks? First of all, they arrange their neurons in 3 dimensional layers (width, height, and depth), and transform a 3 dimensional input to a 3 dimensional output. They are composed of different types of layers, with its core: the convolutional layer. Instead of connecting to each neuron as in fully connected layers, the convolutional layers only process input from a local region of the input volume. The spatial extent of this connectivity is called the receptive field of the neuron.
The convolution layer hence reduces the number of free parameters, allowing the network to be deeper with fewer parameters. For instance, regardless of the image size, tiling regions of 5 x 5 pixels, each with the same shared weights — filter — , require only 25 learnable parameters. The output of a convolutional layer is a so called feature map. Stacking a lot of such layers together leads to a network that first creates representations of small parts of the input, then from them assembles representations of larger areas.
In practice, a convolutional neural network learns the values of these filters on its own during the training process. The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images. The size of each feature map is controlled by three hyper-parameters:
- depth: number of stacked filters
- stride: number of pixels by which we slide our filter matrix over the input matrix
- Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix.
Do you want to shed some light into the deep learning black box? Then have a look at the visualisation tool of Adam Harley.
Woah that’s a lot of information for you to digest so we’re gonna to leave it at that for this week. We have seen the inner workings of deep learning algorithms and we hope that you feel confident in building your own network now. If you have any questions or comments don’t hesitate to write us in the comments section. In part 4, we will have a look at how deep learning is used in practice, including its necessary computing resources, dataset creation, model training and finally its deployment.
Deepomatic wishes you a Merry, Merry Christmas!