Transfer Learning - Notes on AI

# Transfer Learning [[Convolutional Neural Networks (CNN)]] are good for transfer learning. Even if our dataset T is not large, we can train a CNN for it. Pre-train a network on the dataset S. Then, there are two approaches: - Fine-tuning layers - CNN as feature extractor Note: If weights cannot be copied, Transfer learning can still be used by the methods such as Distillation. ## Fine-tuning layers ![[transfer-learning-matrix.jpg]] Assume parameters trained on S are already a good initial solution. Use them as the initial parameters for our new CNN for the target dataset. $ w_{l}^{S}=w_{l, \text {init}}^{T} \text { for layers } l=1,2, \ldots $ Better use when your source S is large and target T is small (relatively), e.g. reuse parameters from Imagenet models for smaller datasets. ### But what layers to tune and how? Classifier layer to loss - The loss layer essentially is the "classifier" - Same labels -> keep the weights from $h_{S}$ - Different labels -> delete the layer and start over - When too few data, fine-tune only this layer Fully connected layers - Very important for fine-tuning - Maybe delete the last layer before the classification layer if datasets are very different - They combine spatial features, more semantics is present - If you have more data, fine-tune these layers first Upper convolutional layers (conv4, conv5) - Mid-level spatial features (face, wheel detectors ...) - Can be different from dataset to dataset - Capture more generic information - Fine-tuning pays off - Fine-tune if dataset is big enough Lower convolutional layers (conv1, conv2) - They capture low level information - This information does not change usually - Probably, no need to fine-tune but no harm trying - At this level, maybe no fine-tuning needed ### How to fine tune? For layers initialized from $h_{S}$ use a mild learning rate - Your network is already close to a near optimum - If too aggressive, learning might diverge - A learning yate of 0.001 is a good starting choice (assuming 0.01 was the original learning rate) For completely new layers (e.g. loss) use aggressive learning rate - If too small, the training will converge very slowly - The rest of the network is near a solution, this layer is very far from one - A learning rate of 0.01 is a good starting choice If datasets are very similar, fine-tune only fully connected layers If datasets are different and you have enough data, fine-tune all layers ## CNN as feature extractor Similar to fine tuning above, where you train only the loss layer. But here CNN is used to extract a vector of abstract features, and an external classifier is used. Essentially use the network as a pretrained feature extractor When to do this instead? When the target dataset T is very small: - Any fine-tuning of layer might cause overfitting - Or when we don't have the resources to train a deep net - Or when we don't care for the best possible accuracy, but will still give good performance --- ## References 1. Lecture 4.5, UvA DL course 2020 2. Guest Lecture, Thomas Mensink, Google Amsterdam