7 Unsupervised learning for text data

This chapter covers

Text data analysis: use cases and challenges
Preprocessing and cleaning text data
Vector representation methods for text data
Sentiment analysis and text clustering using Python
Generative AI applications for text data

Everybody smiles in the same language.
—George Carlin

Our world has so many languages. These languages are the most common medium of communication to express our thoughts and emotions. These words can be written into text. In this chapter, we explore the sorts of analysis we can do on text data. Text data falls under unstructured data and carries a lot of useful information and hence is a useful source of insights for businesses. We use natural language processing (NLP) to analyze the text data.

At the same time, to analyze text data, we have to make the data analysis-ready. Or, in very simple terms, since our algorithms and processors can only understand numbers, we have to represent the text data in numbers or vectors. We will explore all these steps in this chapter. Text data holds the key to quite a few important use cases, such as sentiment analysis, document categorization, and language translation, to name a few. We will cover the use cases using a case study and develop a Python solution on the same.

7.1 Technical toolkit

7.2 Text data is everywhere

7.3 Use cases of text data

7.4 Challenges with text data

7.5 Preprocessing the text data

7.6 Data cleaning

7.7 Extracting features from the text dataset

7.8 Tokenization

7.9 BOW approach

7.10 Term frequency and inverse document frequency

7.11 Language models

7.12 Text cleaning using Python

7.13 Word embeddings

7.14 Word2Vec and GloVe

7.15 Sentiment analysis case study with Python implementation

7.16 Text clustering using Python