Curriculum Learning

Inspired by and repost from gist and quora

1. Curriculum Learning

Introduction

  • Curriculum Learning - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.
  • Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
  • Link to the paper.

Contributions of the paper

  • Explore cases that show that curriculum learning benefits machine learning.
  • Offer hypothesis around when and why does it happen.
  • Explore relation of curriculum learning with other machine learning approaches.

Experiments with convex criteria

  • Training perceptron where some input data is irrelevant(not predictive of the target class).
  • Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
  • Curriculum learning model outperforms no-curriculum based approach.
  • Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.

Experiments on shape recognition with datasets having different variability in shapes

  • Standard(target) dataset - Images of rectangles, ellipses, and triangles.
  • Easy dataset - Images of squares, circles, and equilateral triangles.
  • Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called switch epoch).
  • For no-curriculum learning, the first epoch is the switch epoch.
  • As switch epoch increases, the classification error comes down with the best performance when switch epoch is half the total number of epochs.
  • Paper does not report results for higher values of switch epoch.

Experiments on language modeling

  • Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
  • Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
  • Each word from the vocabulary is embedded into a d dimensional feature space using a matrix W (to be learnt).
  • The model predicts the score of next word, given a window of words.
  • Expected value of ranking loss function is minimized to learn W.
  • Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.

Curriculum as a continuation method

  • Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
  • Useful in the case where the objective function in non-convex.
  • Consider a family of cost functions Cλ(θ) such that C0(θ) can be easily optimized and C1(θ) is the actual objective function.
  • Start with C0(θ) and increase λ, keeping θ at a local minimum of Cλ(θ).
  • Idea is to move θ towards a dominant (if not global) minima of C1(θ).
  • Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
  • The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).

Advantages of Curriculum Learning

  • Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
  • Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.

Relation to other machine learning approaches

  • Unsupervised preprocessing - Both have a regularizing effect and lower the generalization error for the same training error.
  • Active learning - The learner would benefit most from the examples that are close to the learner’s frontier of knowledge and are neither too hard nor too easy.
  • Boosting Algorithms - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.
  • Transfer learning and Life-long learning - Initial tasks are used to guide the optimisation problem.

Criticism

  • Curriculum Learning is not well understood, making it difficult to define the curriculum.
  • In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.

2. 一点理解

网上关于transfer Learning和fine-tuning的区别有很多讨论,基本的区别就是,transfer Learning是一种理念(concept),而fine-tuning则是其实现的具体方法。

而Curriculum learning和transfer learning的区别主要是在于要学习的domain的差异。

  • Curriculum Learning关注如何将一个较难的任务根据有难到易的标准分为不同阶段来学习,即通过先学习简单课程为后续学习打下基础,以使模型不再那么难收敛。可以类比于,学生为了能理解高等数学,要先从初级的数学知识开始学起,难度逐步增加。
  • 而transfer learning侧重点则是在于如何将在一个domain上学习到的模型迁移(transfer)到新的domain上去。可以类比于,学生学习了数学知识,再想办法将其应用到物理学体系中。

Reference:
[1] http://valser.org/thread-513-1-1.html
[2] https://www.quora.com/What-is-the-difference-between-transfer-learning-and-fine-tuning