Pre-trained models

What is a pre-trained model?

It is a saved network that was previously trained on a large dataset, typically on a large-scale task. It is possible to use the pre-trained model as it is or use transfer learning to customize it.

Why pre-trained model?

Training big models (with huge amounts of parameters) from scratch is expensive.

  • A lot of computing power is needed

  • Involves a lot of training time

  • Large corpora

Let's take a look at the parameters of some of the most powerful language models...

Deep learning neural network parameters until June 2021

Neural network parameters until december of 2022

Some of the biggest model language models we have nowadays

Model

Lab

"Selected playgrounds"

"Parameters (B)"

"Tokens trained (B)"

"Ratio T:P (Chinchilla scaling)"

Training dataset

"Announced "

Public?

Released

"Paper/ Repo"

Notes

GPT-4

OpenAI

TBA

πŸ†† πŸ“š ⬆ πŸ•Έ πŸŒ‹

-

Inflection

TBA

πŸ•Έ

Devs from DeepMind

LaMDA 2

Google AI

YouTube (video only)

⬆ πŸ•Έ πŸ‘₯

May/2022

🟑

TBA

Chatbot with tiny walled garden demo TBA

Fairseq

Meta AI

13 & 1100

πŸ†† πŸ“š ⬆ πŸ•Έ πŸ•Έ πŸ•Έ

Dec/2021

🟒

Dec/2021

GLaM

Google Inc

1200

πŸ†† πŸ“šβ¬† πŸ•Έ πŸ‘₯

Dec/2021

πŸ”΄

N/A

PaLM

Google Research

540

780

2:1

πŸ†† πŸ“šβ¬† πŸ•Έ πŸ‘₯

Apr/2022

πŸ”΄

N/A

MT-NLG

Microsoft/NVIDIA

530

270

1:1

πŸ†† πŸ“š ⬆ πŸŒ‹ πŸ•Έ πŸ•Έ

Oct/2021

πŸ”΄

N/A

BERT-480

Google Research

480

πŸ†† πŸ“š πŸ•Έ

Nov/2021

πŸ”΄

N/A

Submission to benchmarks. Original dataset was BookCorpus + Wikipedia: https://arxiv.org/pdf/1810.04805.pdf

Gopher

DeepMind

280

300

2:1

πŸ†† πŸ“š ⬆ πŸ•Έ πŸŒ‹

Dec/2021

πŸ”΄

N/A

Luminous

Aleph Alpha

πŸ•Έ

Nov/2021

🟒

Apr/2022

Devs from EleutherAI

Jurassic-1

AI21

178

300

2:1

πŸ†† πŸ“š ⬆ πŸ•Έ

Aug/2021

🟒

Aug/2021

Emulated GPT-3 dataset

BLOOMZ

BigScience

176

366

3:1

⬆ πŸ•Έ

Nov/2022

🟒

Nov/2022

fine-tuned

OPT-IML

Meta AI

175

300

2:1

πŸ†† πŸ“š ⬆ πŸ•Έ

Dec/2022

🟒

Dec/2022

Instruct

ChatGPT

OpenAI

175

300

2:1

πŸ†† πŸ“š ⬆ πŸ•Έ

Nov/2022

🟒

Nov/2022

Instruct with strict policies ("extremely limited")

BlenderBot 3

Meta AI

blenderbot.ai (US only)

175

πŸ†† πŸ“š ⬆ πŸ•Έ

Aug/2022

🟒

Aug/2022

GPT-3

OpenAI

175

300

2:1

πŸ†† πŸ“š ⬆ πŸ•Έ

May/2020

🟒

Nov/2021

Popular: 3.1M wpm

FLAN

Google

137

⬆ πŸ•Έ πŸ‘₯

Sep/2021

πŸ”΄

N/A

Fine-tuned LaMDA

LaMDA

Google AI

YouTube (video only)

137

168

2:1

⬆ πŸ•Έ πŸ‘₯

Jun/2021

πŸ”΄

N/A

Chatbot

GLM-130B

Tsinghua & Zhipu

130

400

4:1

πŸ†† πŸ“š ⬆ πŸ•Έ

Aug/2022

🟒

Aug/2022

50% English (200B tokens), so included here

Galactica

Meta AI

120

450

4:1

πŸ“š

Nov/2022

🟒

Nov/2022

scientific only

YaLM 100B

Yandex

Github (train/deploy)

100

300

3:1

πŸ†† πŸ“š ⬆ πŸ•Έ

Jun/2022

🟒

Jun/2022

Sparrow

DeepMind

70

1400

20:1

πŸ†† πŸ“š ⬆ πŸ•Έ πŸŒ‹

Sep/2022

πŸ”΄

N/A

Chatbot as a fine-tuned version of Chinchilla 70B

Chinchilla

DeepMind

70

1400

20:1

πŸ†† πŸ“š ⬆ πŸ•Έ πŸŒ‹

Mar/2022

πŸ”΄

N/A

First to double tokens per size increase

NLLB

Meta AI

Github (train/deploy)

54.5

πŸŒ‹

Jul/2022

🟒

Jul/2022

54.5B MOE, 3.3B dense. 200+ languages

xlarge

Cohere

52.4

πŸ“š πŸ•Έ

Sep/2021

🟒

Nov/2021

Stealth 'ebooks and webpages'. 52B: https://crfm.stanford.edu/helm/v1.0/?models=1

RL-CAI

Anthropic

52

πŸ†† πŸ“šβ¬† πŸ•Έ πŸ‘₯

Dec/2022

πŸ”΄

N/A

RLAIF=reinforcement learning with AI feedback

AlexaTM 20B

Amazon Alexa AI

Github (train/deploy)

20

1000

50:1

πŸ†† πŸ•Έ

Aug/2022

🟒

TBA

Wikipedia and mC4 only. seq2seq

Key:

πŸ†† Wikipedia

πŸ‘₯ Dialogue

πŸ“š Books

πŸ†€πŸ…° Questions and answers

⬆ Reddit outbound

πŸŒ‹ Special

πŸ•Έ Common Crawl

πŸ‡«πŸ‡· French

Last updated