Data Without Labels cover
welcome to this free extract from
an online version of the Manning book.
to read more
or

7 Unsupervised learning for text data

 

This chapter covers

  • Text data analysis: use cases and challenges
  • Preprocessing and cleaning text data
  • Vector representation methods for text data
  • Sentiment analysis and text clustering using Python
  • Generative AI applications for text data
Everybody smiles in the same language.
—George Carlin

Our world has so many languages. These languages are the most common medium of communication to express our thoughts and emotions. These words can be written into text. In this chapter, we explore the sorts of analysis we can do on text data. Text data falls under unstructured data and carries a lot of useful information and hence is a useful source of insights for businesses. We use natural language processing (NLP) to analyze the text data.

At the same time, to analyze text data, we have to make the data analysis-ready. Or, in very simple terms, since our algorithms and processors can only understand numbers, we have to represent the text data in numbers or vectors. We will explore all these steps in this chapter. Text data holds the key to quite a few important use cases, such as sentiment analysis, document categorization, and language translation, to name a few. We will cover the use cases using a case study and develop a Python solution on the same.

7.1 Technical toolkit

7.2 Text data is everywhere

7.3 Use cases of text data

7.4 Challenges with text data

7.5 Preprocessing the text data

7.6 Data cleaning

7.7 Extracting features from the text dataset

7.8 Tokenization

7.9 BOW approach

7.10 Term frequency and inverse document frequency

7.11 Language models

7.12 Text cleaning using Python

7.13 Word embeddings

7.14 Word2Vec and GloVe

7.15 Sentiment analysis case study with Python implementation

7.16 Text clustering using Python