In this post, I would focus on all of the theoretical knowledge you need for the latest trends in NLP. I made this reading list as I learned new concepts. For the resources, I include papers, blogs, videos.
It is not necessary to read most of the stuff. Your main goal should be to understand that in this paper this thing was introduced and do I understand how it works, how it compares it with state of the art.
Trend: Use bigger transformer based models and solve multi-task learning.
fastai:- I had already watched the videos, so I thought I should add it to the top of the list.
LSTM:- Although transformers are mainly used nowadays, in some cases you can still use LSTM and it was the first successful model to get good results. You should use AWD_LSTM now if you want.
AWD_LSTM:- It was proposed to overcome the shortcoming of LSTM by introducing dropout between hidden layers, embedding dropout, weight tying. You should use AWS_LSTM instead of LSTM.
Pointer Models:- Although not necessary, it is a good read. You can think of it as pre-attention theory.
Attention:- Remember Attention is not all you need.
- CS224n video explaining attention. Attention starts from 1:00:55 hours.
- Attention is all you need paper. This paper also introduces the Transformer which is nothing but a stack of encoder and decoder blocks. The magic is how these blocks are made and connected.
- Read an annotated version of the above paper in PyTorch.
- Official video explaining Attention
- Google blog for Transformer
- If you are interested in video you can check these link1, link2.
- Transformer-XL: Attentive Language Models Beyond a Fixed Length Context paper. Better version of Transformer but BERT does not use this.
- Google blog for Transformer-XL
- Transformer-XL — Combining Transformers and RNNs Into a State-of-the-art Language Model blog
- For video check this link.
- The Illustrated Transformer blog
- Attention and Memory in Deep Learning and NLP blog.
- Attention and Augmented Recurrent Neural Networks blog.
- Building the Mighty Transformer for Sequence Tagging in PyTorch: Part 1 blog.
- Building the Mighty Transformer for Sequence Tagging in PyTorch: Part 2 blog.
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) [blog](http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/_
- Character-Level Language Modeling with Deeper Self-Attention paper.
- Using the output embedding to Improve Langauge Models paper.
- Quasi-Recurrent Neural Networks paper. A very fast version of LSTM. It uses convolution layers to make LSTM computations parallel. Code can be found in the fastai_library or official_code.
- Deep Learning for NLP Best Practices blog by Sebastian Ruder. A collection of best practices to be used when training LSTM models.
- Notes on the state of the art techniques for language modeling blog. A quick summary where Jeremy Howard summarizes some of his tricks which he uses in fastai library.
- Language Modes and Contextualized Word Embeddings blog. Gives a quick overview of ELMo, BERT, and other models.
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) blog.
Multi-task Learning:- I am really excited about this. In this case, you train a single model for multiple tasks (more than 10 if you want). So your data looks like “translate to english some_text_in_german”. Your model actually learns to use the initial information to choose the task that it should perform.
PyTorch:- Pytorch provide good tutorials giving you good references on how to code up most of the stuff in NLP.
ELMo:- The first prominent research done where we moved from pretrained word-embeddings to using pretrained-models for getting the word-embeddings. So you use the input sentence to get the embeddings for the tokens present in the sentence.
ULMFit:- Is this better than BERT maybe not, but still in Kaggle competitions and external competitions ULMFiT gets the first place.
OpenAI GPT:- I have not compared BERT with GPT2, but you work on some kind on ensemble if you want. Do not use GPT1 as BERT was made to overcome the limitations of GPT1.
BERT:- The most successful language model right now (as of May 2019).
To use all these models in PyTorch/Tensorflow you can use hugginface/transformers which gives complete implementations along with pretrained models for BERT, GPT1, GPT2, TransformerXL.
Congrats you made it to the end. You now have most of the theoretical knowledge needed to practice NLP using the latest models and techniques.
What to do now? You only learned the theory, now practice as much as you can.