# Transfer Learning
[[Convolutional Neural Networks (CNN)]] are good for transfer learning. Even if our dataset T is not large, we can train a CNN for it. Pre-train a network on the dataset S.
Then, there are two approaches:
- Fine-tuning layers
- CNN as feature extractor
Note: If weights cannot be copied, Transfer learning can still be used by the methods such as Distillation.
## Fine-tuning layers
![[transfer-learning-matrix.jpg]]
Assume parameters trained on S are already a good initial solution. Use them as the initial parameters for our new CNN for the target dataset.
$
w_{l}^{S}=w_{l, \text {init}}^{T} \text { for layers } l=1,2, \ldots
$
Better use when your source S is large and target T is small (relatively), e.g. reuse parameters from Imagenet models for smaller datasets.
### But what layers to tune and how?
Classifier layer to loss
- The loss layer essentially is the "classifier"
- Same labels -> keep the weights from $h_{S}$
- Different labels -> delete the layer and start over
- When too few data, fine-tune only this layer
Fully connected layers
- Very important for fine-tuning
- Maybe delete the last layer before the classification layer if datasets are very different
- They combine spatial features, more semantics is present
- If you have more data, fine-tune these layers first
Upper convolutional layers (conv4, conv5)
- Mid-level spatial features (face, wheel detectors ...)
- Can be different from dataset to dataset
- Capture more generic information
- Fine-tuning pays off
- Fine-tune if dataset is big enough
Lower convolutional layers (conv1, conv2)
- They capture low level information
- This information does not change usually
- Probably, no need to fine-tune but no harm trying
- At this level, maybe no fine-tuning needed
### How to fine tune?
For layers initialized from $h_{S}$ use a mild learning rate
- Your network is already close to a near optimum
- If too aggressive, learning might diverge
- A learning yate of 0.001 is a good starting choice (assuming 0.01 was the original learning rate)
For completely new layers (e.g. loss) use aggressive learning rate
- If too small, the training will converge very slowly
- The rest of the network is near a solution, this layer is very far from one
- A learning rate of 0.01 is a good starting choice
If datasets are very similar, fine-tune only fully connected layers
If datasets are different and you have enough data, fine-tune all layers
## CNN as feature extractor
Similar to fine tuning above, where you train only the loss layer. But here CNN is used to extract a vector of abstract features, and an external classifier is used. Essentially use the network as a pretrained feature extractor
When to do this instead? When the target dataset T is very small:
- Any fine-tuning of layer might cause overfitting
- Or when we don't have the resources to train a deep net
- Or when we don't care for the best possible accuracy, but will still give good performance
---
## References
1. Lecture 4.5, UvA DL course 2020
2. Guest Lecture, Thomas Mensink, Google Amsterdam