11 Building a generative pretrained Transformer from scratch

This chapter covers

Building a generative pretrained Transformer from scratch
Causal self-attention
Extracting and loading weights from a pretrained model
Generating coherent text with GPT-2, the predecessor of ChatGPT and GPT-4

Generative Pretrained Transformer 2 (GPT-2) is an advanced large language model (LLM) developed by OpenAI and announced in February 2019. It represents a significant milestone in the field of natural language processing (NLP) and has paved the way for the development of even more sophisticated models, including its successors, ChatGPT and GPT-4.

GPT-2, an improvement over its predecessor, GPT-1, was designed to generate coherent and contextually relevant text based on a given prompt, demonstrating a remarkable ability to mimic human-like text generation across various styles and topics. Upon its announcement, OpenAI initially decided not to release to the public the most powerful version of GPT-2 (also the one you’ll build from scratch in this chapter, with 1.5 billion parameters). The main concern was potential misuse, such as generating misleading news articles, impersonating individuals online, or automating the production of abusive or fake content. This decision sparked a significant debate within the AI and tech communities about the ethics of AI development and the balance between innovation and safety.

11.1 GPT-2 architecture and causal self-attention

11.1.1 The architecture of GPT-2

11.1.2 Word embedding and positional encoding in GPT-2

11.1.3 Causal self-attention in GPT-2

11.2 Building GPT-2XL from scratch

11.2.1 BPE tokenization

11.2.2 The Gaussian error linear unit activation function

11.2.3 Causal self-attention

11.2.4 Constructing the GPT-2XL model

11.3 Loading up pretrained weights and generating text

11.3.1 Loading up pretrained parameters in GPT-2XL

11.3.2 Defining a generate() function to produce text

11.3.3 Text generation with GPT-2XL

Summary