
4 Data engineering for large language models: Setting up for success
This chapter covers
- Common foundation models used in the industry
- How to evaluate and compare large language models
- Different data sources and how to prepare your own
- Creating your own custom tokenizers and embeddings
- Preparing a Slack dataset to be used in future chapters
Data is like garbage. You’d better know what you are going to do with it before you collect it.
Creating our own LLM is no different from any ML project in that we will start by preparing our assets—and there isn’t a more valuable asset than your data. All successful AI and ML initiatives are built on a good data engineering foundation. It’s important then that we acquire, clean, prepare, and curate our data.
Unlike other ML models, you generally won’t be starting from scratch when creating an LLM customized for your specific task. Of course, if you do start from scratch, you’ll likely only do it once. Then it’s best to tweak and polish that model to further refine it for your specific needs. Selecting the right base model can make or break your project. Figure 4.1 gives a high-level overview of the different pieces and assets you’ll need to prepare before training or finetuning a new model.