Hands-on Transfer Learning with Keras and the VGG16 Model
In a previous article, we introduced the fundamentals of image classification with Keras, where we built a CNN to classify food images. Our model didn't perform that well, but we can make significant improvements in accuracy without much more training time by using a concept called Transfer Learning.
By the end of this article, you should be able to:
- Download a pre-trained model from Keras for Transfer Learning
- Fine-tune the pre-trained model on a custom dataset
Let's get started.
What Is Transfer Learning?
In the previous article, we defined our own Convolutional Neural Network and trained it on a food image dataset. We saw that the performance of this from-scratch model was drastically limited.
This model had to first learn how to detect generic features in the images, such as edges and blobs of color, before detecting more complex features.
In real-world applications, this can take days of training and millions of images to achieve high performance. It would be easier for us to download a generic pretrained model and retrain it on our own dataset. This is what Transfer Learning entails.
In this way, Transfer Learning is an approach where we use one model trained on a machine learning task and reuse it as a starting point for a different job. Multiple deep learning domains use this approach, including Image Classification, Natural Language Processing, and even Gaming! The ability to adapt a trained model to another task is incredibly valuable.
This tutorial expects that you have an understanding of Convolutional Neural Networks. If you want an in-depth look into these networks, feel free to read our previous article.
In this section, we'll review CNN building blocks. Feel free to skip ahead for the Python implementation.
Convolutional Neural Network Architecture
Recall that CNN architecture contains some essential building blocks such as:
1. Convolutional Layer:
- Conv. Layers will compute the output of nodes that are connected to local regions of the input matrix.
- Dot products are calculated between a set of weights (commonly called a filter) and the values associated with a local region of the input.
2. ReLu (Activation) Layer:
- The output volume of the Conv. Layer is fed to an elementwise activation function, commonly a Rectified-Linear Unit (ReLu).
- The ReLu layer will determine whether an input node will 'fire' given the input data. This 'firing' signals whether the convolution layer's filters have detected a visual feature.
- A ReLu function will apply a $max(0,x)$ function, thresholding at 0.
- The dimensions of the volume are left unchanged.
3. Pooling Layer:
- A down-sampling strategy is applied to reduce the width and height of the output volume.
4. Fully-Connected Layer:
- The output volume, i.e. 'convolved features', are passed to a Fully-Connected Layer of nodes.
- Like conventional neural-networks, every node in this layer is connected to every node in the volume of features being fed-forward.
- The class probabilities are computed and are outputted in a 3D array (the Output Layer) with dimensions:
[1 x 1 x K], where K is the number of classes.
Writing these types of models from scratch can be incredibly tricky, especially if we don't have a dataset of sufficient size. The CNN model that we'll discuss later in this article has been pre-trained on millions of photos! We'll explore how we can use the pre-trained architecture to solve our custom classification problem.
Predicting Food Labels with a Keras CNN
Initially, we wrote a simple CNN from scratch. We'll load the same model as before to generate some predictions and calculate its accuracy, which will be used to compare the performance of the new model using Transfer Learning.
Our from-scratch CNN has a relatively simple architecture: 7 convolutional layers, followed by a single densely-connected layer.
Using the old CNN to calculate an accuracy score (details of which you can find in the previous article) we found that we had an accuracy score of ~58%.
With such an accuracy score, the from-scratch CNN performs moderately well, at best. We could improve the accuracy with a sufficiently-sized training dataset, which we do not have.
In practice, you should write a CNN from scratch only if you have a large dataset. In this tutorial, we'll download a pretrained model and re-train it on our own dataset to generate a better model.
How Does Transfer Learning Work?
Transfer Learning partially resolves the limitations of the isolated learning paradigm:
"The current dominant paradigm for ML is to run an ML algorithm on a given dataset to generate a model. The model is then applied in real-life tasks. We call this paradigm isolated learning because it does not consider any other related information or the knowledge learned in the past." (Liu, 2016)
Transfer Learning gives us the ability to share learned features across different learning tasks.
Domains and Tasks
We can understand Transfer Learning in terms of Domains and Tasks. In our case, the domain is image classification, and our task is to classify food images. Like we did previously, starting from scratch would require many optimizations, more data, and longer training to improve performance. If we use a CNN that's already been optimized and trained for a similar domain and task, we could convert it to work with our task. This is what transfer learning accomplishes.
We will utilize the pre-trained VGG16 model, which is a convolutional neural network trained on 1.2 million images to classify 1000 different categories. Since the domain and task for VGG16 are similar to our domain and task, we can use its pre-trained network to do the job.
For details on a more mathematical definition, see the paper Improving EEG-Based Emotion Classification Using Conditional Transfer Learning.
Using Pretrained Convolutional Layers
Our Transfer Learning approach will involve using layers that have been pre-trained on a source task to solve a target task. We would typically download some pre-trained model and "cut off" its top portion (the fully-connected layer), leaving us with only the convolutional and pooling layers.
Using the pre-trained layers, we'll extract visual features from our target task/dataset.
When using these pre-trained layers, we can decide to freeze specific layers from training. We'll be using the pre-trained weights as-they-come and not updating them with backpropagation.
Alternatively, we can freeze most of the pre-trained layers but allow other layers to update their weights to improve target data classification.
How to Utilize the VGG16 Model
VGG16 is a convolutional neural network trained on a subset of the ImageNet dataset, a collection of over 14 million images belonging to 22,000 categories. K. Simonyan and A. Zisserman proposed this model in the 2015 paper, Very Deep Convolutional Networks for Large-Scale Image Recognition.
In the 2014 ImageNet Classification Challenge, VGG16 achieved a 92.7% classification accuracy. But more importantly, it has been trained on millions of images. Its pre-trained architecture can detect generic visual features present in our Food dataset.
Now suppose we have many images of two kinds of cars: Ferrari sports cars and Audi passenger cars. We want to generate a model that can classify an image as one of the two classes. Writing our own CNN is not an option since we do not have a dataset sufficient in size. Here's where Transfer Learning comes to the rescue!
We know that the ImageNet dataset contains images of different vehicles (sports cars, pick-up trucks, minivans, etc.). We can import a model that has been pre-trained on the ImageNet dataset and use its pre-trained layers for feature extraction.
Now we can't use the entirety of the pre-trained model's architecture. The Fully-Connected layer generates 1,000 different output labels, whereas our Target Dataset has only two classes for prediction. So we'll import a pre-trained model like VGG16, but "cut off" the Fully-Connected layer - also called the "top" model.
Once the pre-trainedlayers have been imported, excluding the "top" of the model, we can take 1 of 2 Transfer Learning approaches.
1. Feature Extraction Approach
We use the pre-trained model's architecture to create a new dataset from our input images in this approach. We'll import the Convolutional and Pooling layers but leave out the "top portion" of the model (the Fully-Connected layer).
Recall that our example model, VGG16, has been trained on millions of images - including vehicle images. Its convolutional layers and trained weights can detect generic features such as edges, colors, wheels, windshields, etc.
We'll pass our images through VGG16's convolutional layers, which will output a Feature Stack of the detected visual features. From here, it's easy to flatten the 3-Dimensional feature stack into a NumPy array - ready for whatever modeling you'd prefer to conduct.
We can do feature extraction in the following manner:
- Download the pre-trained model. Ensure that the "top" portion of the model - the Fully-Connected layer - is not included.
- Pass the image data through the pre-trained layers to extract convolved visual features
- The outputted feature stack will be 3-Dimensional, and for it to be used for prediction by other machine learning classifiers, it will need to be flattened.
- At this point, you have two options:
- Stand-Alone Extractor: In this scenario, you can use the pre-trained layers to extract image features once. The extracted features would then create a new dataset that doesn't require any image processing.
- Bootstrap Extractor: Write your own Fully-Connected layer, and integrate it with the pre-trained layers. In this sense, you are bootstrapping your own "top model" onto the pre-trained layers. Initialize this Fully-Connected layer with random weights, which will update via backpropagation during training.
This article will show how to implement a "bootstrapped" extraction of image data with the VGG16 CNN. Pre-trained layers will convolve the image data according to ImageNet weights. We will bootstrap a Fully-Connected layer to generate predictions.
2. Fine-Tuning Approach
In this approach, we employ a strategy called Fine-Tuning. The goal of fine-tuning is to allow a portion of the pre-trained layers to retrain.
In the previous approach, we used the pre-trained layers of VGG16 to extract features. We passed our image dataset through the convolutional layers and weights, outputting the transformed visual features. There was no actual training on these pre-trained layers.
Fine-tuning a Pre-trained Model entails:
- Bootstrapping a new "top" portion of the model (i.e., Fully-Connected and Output layers)
- Freezing pre-trained convolutional layers
- Un-freezing the last few pre-trained layers training.
The frozen pre-trained layers will convolve visual features as usual. The non-frozen (i.e., the 'trainable') pre-trained layers will be trained on our custom dataset and update according to the Fully-Connected layer's predictions.
In this article, we will demonstrate how to implement Fine-tuning on the VGG16 CNN. We will load some of the pre-trained layers as 'trainable', pass image data through the pre-trained layers, and 'fine-tune' the trainable layers alongside our Fully-Connected layer.
Downloading the Dataset
Before we demonstrate either of these approaches, ensure you've downloaded the data for this tutorial.
To access the data used in this tutorial, check out the Image Classification with Keras article. You can find the terminal commands and functions for splitting the data in this section. If you're starting from scratch, make sure to run the
split_dataset function after downloading the dataset so that the images are in the correct directories for this tutorial.
Using Transfer Learning for Food Classification
Pre-trained models, such as VGG16, are easily downloaded using the Keras API. We'll go ahead and use VGG16 for the tutorial, but you should explore the other models available! Many of them have been trained on the ImageNet dataset and come with their advantages and disadvantages. You can find a list of the available models here.
We've also imported something called a
preprocess_function alongside the VGG16 model. Recall that image data must be normalized before training. Images are composed of 3-Dimensional matrices containing numerical values in a range of
[0, 255]. Not all CNNs have the same normalization scheme, however.
The VGG16 model was trained on data wherein pixel values ranged from
[0, 255], and the mean pixel values of the dataset are subtracted from each image channel.
Other models have different normalization schemes, details of which are in their documentation. Some models require scaling the numerical values to be between (-1, +1).
Preparing the training and testing data
Let's first import some necessary libraries.
In the previous article, we defined image generators (see here) for our particular use case. Now, we'll need to utilize the VGG16 preprocessing function on our image data.
With our ImageDataGenerator's, we can now
flow_from_directory using the same image directory as the last article:
Using Pre-trained Layers for Feature Extraction
In this section, we'll demonstrate how to perform Transfer Learning without fine-tuning the pre-trained layers. Instead, we'll first use pre-trained layers to process our image dataset and extract visual features for prediction. Then we are creating a Fully-connected layer and Output layer for our image dataset. Finally, we will train these layers with backpropagation.
You'll see in the
create_model function the different components of our Transfer Learning model:
- On line 13, we assign the stack of pre-trained model layers to the variable
conv_base. Note that
include_top=Falseto exclude VGG16's pre-trained Fully-Connected layer.
- On lines 18-25, if the arg
fine_tuneis set to 0, all pre-trained layers will be frozen and left un-trainable. Otherwise, the last
nlayers will be made available for training.
- On lines 29-30, we set up a new "top" portion of the model by grabbing the
conv_baseoutputs and flattening them.
- On lines 31-33, we define the new Fully-Connected layer, which we'll train with backpropagation. We include dropout regularization to reduce over-fitting.
- Line 34 defines the model's output layer, where the total number of outputs is equal to
Training Without Fine-Tuning
Now, we'll define the parameters similar to the first article, but with a larger input shape. Then we'll create the model without fine-tuning:
Our compiled model contains the pre-trained weights and layers of VGG16. In this case, we chose to set
fine_tune=0, which will freeze all pre-trained layers.
This model will perform feature extraction using the frozen pre-trained layers and train a Fully-Connected layer for predictions. For more info on the callbacks used and the fit parameters, see this section of the previous article.
We can now train the model defined above:
Using Pre-trained Layers for Fine-Tuning
Wow! What an improvement from our custom CNN! Integrating VGG16's pre-trained layers with an initialized Fully-Connected layer achieved an accuracy of 73%! But how can we do better?
In this next section, we will re-compile the model but allow for backpropagation to update the last two pre-trained layers.
You'll notice that we compile this Fine-tuning model with a lower learning rate, which will help the Fully-Connected layer "warm-up" and learn robust patterns previously learned before picking apart more minute image details.
Just as before, we'll initialize our Fully-Connected layer and its weights for training.
An accuracy of 81%! Amazing what unfreezing the last convolutional layers can do for model performance. Let's get a better idea of how our different models have performed in classifying the data.
In addition to comparing the models created in this article, we will also want to compare the last article's custom model. At the beginning of this article, we loaded the from-scratch model's learned weights, so we need to make predictions to compare against the transfer learning models.
Since our last model had a different image size target, we first need to make a new ImageDataGenerator to make predictions. Here's that code:
We now have predictions for all three models we want to compare. Below is a function for visualizing class-wise predictions in a confusion matrix using the
heatmap method Seaborn, a visualization library. Confusion matrices are NxN matrices where N is the number of classes, and predicted and target labels are plotted along the X- and Y-axes, respectively. Essentially, this tells us how many correct and incorrect classifications each model made by comparing the true class versus the predicted class. Naturally, the larger the values down the diagonal, the better the model did.
Here's our visualization code:
The transfer learning model with fine-tuning is the best, evident from the stronger diagonal and lighter cells everywhere else. We can also see from the confusion matrix that this model most commonly misclassifies apple pie as bread pudding. Overall, though, it's a clear winner.
Recall that our Custom CNN accuracies, Transfer Learning Model with Feature Extraction, and Fine-Tuned Transfer Learning Model are 58%, 73%, and 81%, respectively.
We could see improved performance on our dataset as we introduce fine-tuning. Selecting the appropriate number of layers to unfreeze can require careful experimentation.
Other parameters to consider when training your network include:
- Optimizers: in this article, we used the Adam optimizer to update our weights during training. When training your network, you should experiment with other optimizers and their learning rate.
- Dropout: recall that Dropout is a form of regularization to prevent overfitting of the network. We introduced a single dropout layer in our Fully-Connected layer to constrain the network from over-learning certain features.
- Fully-Connected Layer: if you are taking a bootstrapped approach to Transfer Learning, ensure that your Fully-Connected layer is structured appropriately for the classification task. Is the number of input nodes correct for the outputted features? Do we have too many densely-connected layers?
In this article, we solved an image classification problem using a custom dataset using Transfer Learning. We saw that by employing various Transfer Learning strategies such as Fine-Tuning, we can generate a model that outperforms a custom-written CNN. Some key takeaways:
- Transfer learning can be a great starting point for training a model when you do not possess a large amount of data.
- Transfer learning requires that a model has been pre-trained on a robust source task which can be easily adapted to solve a smaller target task.
- Transfer learning is easily accessible through the Keras API. You can find available pre-trained models here.
- Fine-Tuning a portion of pre-trained layers can boost model performance significantly
Convolutional Neural Networks– Andrew Ng, Coursera
Andrew Ng's Deep Learning course on CNNs contains videos that offer detailed explanations of CNN concepts.
CS231n Convolutional Neural Networks for Visual Recognition– Stanford University
These notes accompany the Stanford University course and are updated regularly