The Quick Python Book, Fourth Edition cover
welcome to this free extract from
an online version of the Manning book.
to read more
or

21 Processing data files

 

This chapter covers

  • Using extract-transform-load
  • Reading text data files (plain text and CSV)
  • Reading spreadsheet files
  • Normalizing, cleaning, and sorting data
  • Writing data files

Much of the data available is contained in text files. This data can range from unstructured text, such as a corpus of tweets or literary texts, to more structured data in which each row is a record and the fields are delimited by a special character, such as a comma, a tab, or a pipe (|). Text files can be huge; a dataset can be spread over tens or even hundreds of files, and the data in it can be incomplete or horribly dirty. With all the variations, it’s almost inevitable that you’ll need to read and use data from text files. This chapter gives you strategies for using Python to do exactly that.

21.1 Welcome to ETL

21.2 Reading text files

21.2.1 Text encoding: ASCII, Unicode, and others

21.2.2 Unstructured text

21.2.3 Delimited flat files

21.2.4 The csv module

21.2.5 Reading a csv file as a list of dictionaries

21.3 Excel files

21.4 Data cleaning

21.4.1 Cleaning

21.4.2 Sorting

21.4.3 Data cleaning problems and pitfalls

21.5 Writing data files

21.5.1 CSV and other delimited files

21.5.2 Writing Excel files

21.5.3 Packaging data files

21.6 Weather observations

21.6.1 Solving the problem with AI-generated code

21.6.2 Solutions and discussion

Summary