GPT3 | Notion

Resources:

Main contributions

An autoregressive language model with 175B parrameters, 10x larger than any previous lms
Introduced the concept of incontext learning and showed competitive performance

<aside> 💡

Zero shot learning

ability of a ml model to perform tasks it has never seen before without requiring any task specific training data
in simpler terms, it allows a model to generalize and solve tasks with no direct training or examples of those tasks
Instead, the model uses prior knowledge learned from other tasks or domains to make predictions on new, unseen task

Key concepts:

Example

imagine a model trained on massive corpus of text and has seen variety of tasks during training.
Now, we want it to perform a new task, say text summarization. We don’t train the model on summarization examples; instead, we simply provide a prompt like:

Prompt: “Summarize the following article: [Insert article text here].”

The model can recognize that the task is to generate a summary of the article, and it uses its understanding of language and the task from prior training to perform the summarization — even though it has never seen a summarization-specific dataset. </aside>

recent trends : pretrained lm in NLP tasks - increasingly flexible and task-agnostic for downstream tasks

task-agnostic - model’s ability to perfrom wide variety of tasks without being specifically trained or fine tuned on those tasks

Intially, word vectors(from word2vec,glove) were used to create a single layer representations, which were then fed to task-specific architectures
Later, RNNs with multiple layers and contextual states enhanced these representations (though still fed to task-specific architecutures)
More recently, pre-trained recurrent or transformer models, such as BERT, have been finetuned direclty for tasks, eliminating the need for task-specific architecures
this finetuning paradigm has led to an incredible progress in NLP (question asnwering, textual entailment, reading comprehension etc etc).
However, a major limitation to this approach is that while the architecture is task-agnostic, fine-tuning still requires task-specific datasets and task-specific finetuning
- and the dataset required should be MASSIVE
This process should be repeated for every new task