massive-multitask-language-understanding-mmlu

Chinchilla

Since 2020, manufacturers have been steadily releasing bigger and bigger models like the GPT-3 (175B), LaMDA (137B), Jurassic-1 (178B), Megatron-Turing NLG (530B), and Gopher (280B). According to Kaplan’s law, these models are an improvement over their predecessors (GPT-2, BERT), but they still fall short of their full potential. In their most recent paper, researchers at DeepMind dissect the conventional wisdom that more complex models equal better performance. The company has uncovered a pre