4 Data engineering for large language models: Setting up for success

This chapter covers

Common foundation models used in the industry
How to evaluate and compare large language models
Different data sources and how to prepare your own
Creating your own custom tokenizers and embeddings
Preparing a Slack dataset to be used in future chapters

Data is like garbage. You’d better know what you are going to do with it before you collect it.
—Mark Twain

Creating our own LLM is no different from any ML project in that we will start by preparing our assets—and there isn’t a more valuable asset than your data. All successful AI and ML initiatives are built on a good data engineering foundation. It’s important then that we acquire, clean, prepare, and curate our data.

Unlike other ML models, you generally won’t be starting from scratch when creating an LLM customized for your specific task. Of course, if you do start from scratch, you’ll likely only do it once. Then it’s best to tweak and polish that model to further refine it for your specific needs. Selecting the right base model can make or break your project. Figure 4.1 gives a high-level overview of the different pieces and assets you’ll need to prepare before training or finetuning a new model.

4.1 Models are the foundation

4.1.1 GPT

4.1.2 BLOOM

4.1.3 LLaMA

4.1.4 Wizard

4.1.5 Falcon

4.1.6 Vicuna

4.1.7 Dolly

4.1.8 OpenChat

4.2 Evaluating LLMs

4.2.1 Metrics for evaluating text

4.2.2 Industry benchmarks

4.2.3 Responsible AI benchmarks

4.2.4 Developing your own benchmark

4.2.5 Evaluating code generators

4.2.6 Evaluating model parameters

4.3 Data for LLMs

4.3.1 Datasets you should know

4.3.2 Data cleaning and preparation

4.4 Text processors