Classical approaches to text representation#

1raise SystemExit("Stop right there!");
An exception has occurred, use %tb to see the full traceback.

SystemExit: Stop right there!

Importing libraries and packages#

 1# Modelling
 2from sklearn.feature_extraction.text import CountVectorizer
 3from sklearn.feature_extraction.text import TfidfVectorizer
 4
 5# NLP
 6import nltk
 7from nltk import tokenize
 8from string import punctuation
 9from nltk.corpus import stopwords
10from nltk.stem import PorterStemmer
11from nltk.stem import WordNetLemmatizer
12
13from IPython.display import display, HTML
14
15display(HTML("<style>.container {width:80% !important;}</style>"))

Text preprocessing#

1raw_txt = (
2    "Most people in Lancre, as the saying goes, went to bed "
3    "with the chickens and got up with the cows. [footnote: Er. "
4    "That is to say, they went to bed at the same time as the "
5    "chickens went to bed, and got up at the same time as the "
6    "cows got up. Loosely worded sayings can really cause "
7    "misunderstandings.]"
8)

Tokenization#

A critical task for both query and further parsing is breaking up a text into a sequence of words. Tokenisation converts a string of characters into a sequence of tokens. It is not as simple as splitting text while removing spaces, as can be done with the split method in Java or Python.

Tokenisation is the result of a set of precision and recall trade-offs. A highly literal tokenisation of the query is likely to be good for precision, but bad for recall, while a more aggressive approach using multiple tokenisation will improve recall at the expense of precision.

1nltk.download("punkt")
True
1tokenize.sent_tokenize(raw_txt)
['Most people in Lancre, as the saying goes, went to bed with the chickens and got up with the cows.',
 '[footnote: Er.',
 'That is to say, they went to bed at the same time as the chickens went to bed, and got up at the same time as the cows got up.',
 'Loosely worded sayings can really cause misunderstandings.]']
1txt_sents = tokenize.sent_tokenize(raw_txt)
2print(txt_sents)
['Most people in Lancre, as the saying goes, went to bed with the chickens and got up with the cows.', '[footnote: Er.', 'That is to say, they went to bed at the same time as the chickens went to bed, and got up at the same time as the cows got up.', 'Loosely worded sayings can really cause misunderstandings.]']
1type(txt_sents), len(txt_sents)
(list, 4)
1txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
2type(txt_words), type(txt_words[0])
(list, list)
1print(txt_words[:2])
[['Most', 'people', 'in', 'Lancre', ',', 'as', 'the', 'saying', 'goes', ',', 'went', 'to', 'bed', 'with', 'the', 'chickens', 'and', 'got', 'up', 'with', 'the', 'cows', '.'], ['[', 'footnote', ':', 'Er', '.']]

Case folding#

Case folding is often a part of text normalisation because it would be unreasonable to expect proper capitalisation, hence it is often best to make everything lower case. The Python .lower() function does the trick. For example:

1# You needn't run this
2raw_txt = raw_txt.lower()
3raw_txt
'most people in lancre, as the saying goes, went to bed with the chickens and got up with the cows. [footnote: er. that is to say, they went to bed at the same time as the chickens went to bed, and got up at the same time as the cows got up. loosely worded sayings can really cause misunderstandings.]'
1txt_sents = [sent.lower() for sent in txt_sents]
2txt_sents
['most people in lancre, as the saying goes, went to bed with the chickens and got up with the cows.',
 '[footnote: er.',
 'that is to say, they went to bed at the same time as the chickens went to bed, and got up at the same time as the cows got up.',
 'loosely worded sayings can really cause misunderstandings.]']
1txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
1print(txt_words[:2])
[['most', 'people', 'in', 'lancre', ',', 'as', 'the', 'saying', 'goes', ',', 'went', 'to', 'bed', 'with', 'the', 'chickens', 'and', 'got', 'up', 'with', 'the', 'cows', '.'], ['[', 'footnote', ':', 'er', '.']]

Removing punctuation#

1list_punct = list(punctuation)
2print(list_punct)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
1def drop_punct(input_tokens):
2    return [token for token in input_tokens if token not in list_punct]
1drop_punct(["footnote", ":", "er", "."])
['footnote', 'er']
1txt_words_nopunct = [drop_punct(sent) for sent in txt_words]
2print(txt_words_nopunct)
[['most', 'people', 'in', 'lancre', 'as', 'the', 'saying', 'goes', 'went', 'to', 'bed', 'with', 'the', 'chickens', 'and', 'got', 'up', 'with', 'the', 'cows'], ['footnote', 'er'], ['that', 'is', 'to', 'say', 'they', 'went', 'to', 'bed', 'at', 'the', 'same', 'time', 'as', 'the', 'chickens', 'went', 'to', 'bed', 'and', 'got', 'up', 'at', 'the', 'same', 'time', 'as', 'the', 'cows', 'got', 'up'], ['loosely', 'worded', 'sayings', 'can', 'really', 'cause', 'misunderstandings']]

There can be cases where punctuations are important. For example, when performing sentiment analysis, an exclamation or question mark, or even a comma, adds value.

Removing stop words#

Stop words like “a”, “and”, “to”, and “be” have little semantic content, and there are a lot of them. They can be excluded with a list, or a good compression technique (reducing space) and/or query optimisation technique can be used. The latter two, because some stop words can be important in phrases, like “To be or not to be” and “flights to Amsterdam”, greatly enhancing precision.

1nltk.download("stopwords")
True
1list_stop = stopwords.words("english")
2len(list_stop)
179
1print(list_stop[:50])
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']

Tokenizing, Case Normalization, Punctuation and Stop Word Removal#

1raw_txt = (
2    "He did of course sometimes have people horribly tortured "
3    "to death, but this was considered to be perfectly acceptable "
4    "behaviour for a civic ruler and generally approved of by "
5    "the overhelming majority of citizens. [footnote: The "
6    "overhelming majority of citizens being defined in this case "
7    "as everyone not currently hanging upside down over a scorpion "
8    "pit]"
9)
1txt_sents = tokenize.sent_tokenize(raw_txt.lower())
2txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
3
4stop_punct = list(punctuation)
5
6stop_nltk = stopwords.words("english")
7
8stop_final = stop_punct + stop_nltk
1def drop_stop(input_tokens):
2    return [token for token in input_tokens if token not in stop_final]
1txt_words_nostop = [drop_stop(sent) for sent in txt_words]
1print(txt_words_nostop[0])
['course', 'sometimes', 'people', 'horribly', 'tortured', 'death', 'considered', 'perfectly', 'acceptable', 'behaviour', 'civic', 'ruler', 'generally', 'approved', 'overhelming', 'majority', 'citizens']

Stemming#

Stemming reduces terms to their “roots” (quite aggressively at that). The common stemming approach for English is the Porter stemming algorithm (the Porter stemmer): A collection of rules (heuristics) designed to reflect how English handles inflections. This can increase recall (sacrificing precision) by matching words against their other inflected forms. It is included in all major natural language processing libraries.

1stemmer_p = PorterStemmer()
1print(stemmer_p.stem("going"))
go
1txt = (
2    "It's going to look pretty good, then, isn't it,' said "
3    "War testily, 'the One Horseman and Three Pedestrians of "
4    "the Apocralypse."
5)
1tokens = tokenize.word_tokenize(txt)
2print([stemmer_p.stem(word) for word in tokens])
['It', "'s", 'go', 'to', 'look', 'pretti', 'good', ',', 'then', ',', 'is', "n't", 'it', ',', "'", 'said', 'war', 'testili', ',', "'the", 'one', 'horseman', 'and', 'three', 'pedestrian', 'of', 'the', 'apocralyps', '.']

Lemmatization#

Lemmatisation relies on a lexical knowledge base like WordNet to get the correct base forms of words. A disadvantage is that it cannot handle unknown words and that it requires specifying the word’s part of speech - otherwise, it assumes the word is a noun.

1nltk.download("wordnet")
2nltk.download("omw-1.4")
True
1lemmatizer = WordNetLemmatizer()
1lemmatizer.lemmatize("ponies")
'pony'

The Wordnet lemmatizer in ‘’nltk’’ is only for English.

Stemming data#

The Porter stemmer uses 5 phases of reduction that are applied sequentially. Each phase consists of a set of commands. Some Typical rules are:

  • sses -> ss

  • ies -> i

  • ational -> ate

  • tional -> tion

There are also rules sensitive to the measure of words

  • replacement -> replac

  • cement -> cement

1stemmer_p = PorterStemmer()
1print([stemmer_p.stem(token) for token in txt_words_nostop[0]])
['cours', 'sometim', 'peopl', 'horribl', 'tortur', 'death', 'consid', 'perfectli', 'accept', 'behaviour', 'civic', 'ruler', 'gener', 'approv', 'overhelm', 'major', 'citizen']

Applying stemmer to all the sentences

1txt_words_stem = [
2    [stemmer_p.stem(token) for token in sent] for sent in txt_words_nostop
3]
1txt_words_stem
[['cours',
  'sometim',
  'peopl',
  'horribl',
  'tortur',
  'death',
  'consid',
  'perfectli',
  'accept',
  'behaviour',
  'civic',
  'ruler',
  'gener',
  'approv',
  'overhelm',
  'major',
  'citizen'],
 ['footnot',
  'overhelm',
  'major',
  'citizen',
  'defin',
  'case',
  'everyon',
  'current',
  'hang',
  'upsid',
  'scorpion',
  'pit']]

Both stemming and lemmatisation are effective techniques to expand recall, with lemmatisation giving up some of that recall to increase precision. According to morphological analysis there are only modest benefits for retrieval (in English), as it helps recall but harms precision.

  • operative (dentistry), operating (system), operational (research) -> oper

In other languages with a rich morphological structure (elegant in how it defines the variant forms of words), such as Spanish, German and Finnish, the benefits are better. In Finnish there is even a 30% performance gain.

One-Hot encoding#

1txt_words_nostop
[['course',
  'sometimes',
  'people',
  'horribly',
  'tortured',
  'death',
  'considered',
  'perfectly',
  'acceptable',
  'behaviour',
  'civic',
  'ruler',
  'generally',
  'approved',
  'overhelming',
  'majority',
  'citizens'],
 ['footnote',
  'overhelming',
  'majority',
  'citizens',
  'defined',
  'case',
  'everyone',
  'currently',
  'hanging',
  'upside',
  'scorpion',
  'pit']]
1print(txt_words_nostop)
[['course', 'sometimes', 'people', 'horribly', 'tortured', 'death', 'considered', 'perfectly', 'acceptable', 'behaviour', 'civic', 'ruler', 'generally', 'approved', 'overhelming', 'majority', 'citizens'], ['footnote', 'overhelming', 'majority', 'citizens', 'defined', 'case', 'everyone', 'currently', 'hanging', 'upside', 'scorpion', 'pit']]
1target_terms = ["behaviour", "footnote", "upside"]
1def get_onehot(sent):
2    return [1 if term in sent else 0 for term in target_terms]
3
4
5one_hot_mat = [get_onehot(sent) for sent in txt_words_nostop]
6print(one_hot_mat)
[[1, 0, 0], [0, 1, 1]]

Document-Term Matrix (DTM)#

1txt_sents
['he did of course sometimes have people horribly tortured to death, but this was considered to be perfectly acceptable behaviour for a civic ruler and generally approved of by the overhelming majority of citizens.',
 '[footnote: the overhelming majority of citizens being defined in this case as everyone not currently hanging upside down over a scorpion pit]']
1# Using the top 5 terms from the data for creating the matrix
2vectorizer = CountVectorizer(max_features=5)
1# Training ('fit') the vectorizer on the data
2vectorizer.fit(txt_sents)
CountVectorizer(max_features=5)
1# Looking at the vocabulary
2vectorizer.vocabulary_
{'of': 1, 'to': 4, 'this': 3, 'the': 2, 'citizens': 0}
1# Applying the vectorizer to the data to create the DTM and convert to an array
2txt_dtm = vectorizer.fit_transform(txt_sents).toarray()
3print(txt_dtm)
[[1 3 1 1 2]
 [1 1 1 1 0]]
1# For verification
2txt_sents
['he did of course sometimes have people horribly tortured to death, but this was considered to be perfectly acceptable behaviour for a civic ruler and generally approved of by the overhelming majority of citizens.',
 '[footnote: the overhelming majority of citizens being defined in this case as everyone not currently hanging upside down over a scorpion pit]']

The vectorizer tokenizes the sentence as well. If using it for preprocessed tokens instead ( txt_words_stem ), pass a dummy tokenizer and preprocessor to CountVectorizer.

1# Creating a function that does nothing and simply returns the
2# tokenized sentence/document
3def do_nothing(doc):
4    return doc
1# Instantiating vectorizer to use this dummy function as the
2# preprocessor and tokenizer
3vectorizer = CountVectorizer(
4    max_features=5, preprocessor=do_nothing, tokenizer=do_nothing
5)
1# Fitting and transforming the data and converting to array in one step
2txt_dtm = vectorizer.fit_transform(txt_words_stem).toarray()
3print(txt_dtm)
[[1 1 1 1 1]
 [0 1 1 1 0]]
1# For verification
2vectorizer.vocabulary_
{'sometim': 4, 'accept': 0, 'overhelm': 3, 'major': 2, 'citizen': 1}
1txt_words_stem
[['cours',
  'sometim',
  'peopl',
  'horribl',
  'tortur',
  'death',
  'consid',
  'perfectli',
  'accept',
  'behaviour',
  'civic',
  'ruler',
  'gener',
  'approv',
  'overhelm',
  'major',
  'citizen'],
 ['footnot',
  'overhelm',
  'major',
  'citizen',
  'defin',
  'case',
  'everyon',
  'current',
  'hang',
  'upsid',
  'scorpion',
  'pit']]

Document term matrix with TF-IDF#

In Term frequency–Inverse document frequency (TF-IDF) the importance of a document increases proportionally to the number of times a word appears in a document but is offset by the frequency of the word in the corpus. Variations of the scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query.

  • The number of times a term appears in a document is known as the term frequency.

  • Every term in a document has a weight associated with it.

  • The weight is determined by the frequency of appearance of the term in a document.

Term Frequency (TF)#

‘’’ tf(t,d) = n/N ‘’’

  • ‘tf’ is the term frequency function

  • ‘t’ is a term/ word

  • ‘d’ is a document

  • ‘n’ is the number of occurrences of t in d

  • ‘N’ is the number of occurrences of t in all documents

Inverse Document Frequency (IDF)#

‘’’ idf(t,d) = log (D/{d ∈ D : t ∈ d}) ‘’’

  • ‘idf’ is the inverse document frequency function

  • ‘t’ is a term/word

  • ‘d’ is a document

  • ‘D’ is the total number of documents

  • ‘{ d ∈ D : t ∈ d }’ is the number of documents in which t occur

The product of tf and idf of a term is calculated to be tfidf:

‘’’ tfidf(t,d,D) = tf(t,d) * idf(t,D) ‘’’

The logarithmic term in idf approaches zero for a term present in a higher number of documents => the tfidf value for a more common term approaches zero.

The TF-IDF vectorizer tokenizes the sentence and learns the vocabulary, and returns the adjusted (IDF-multiplied) counts.

1txt_sents
['he did of course sometimes have people horribly tortured to death, but this was considered to be perfectly acceptable behaviour for a civic ruler and generally approved of by the overhelming majority of citizens.',
 '[footnote: the overhelming majority of citizens being defined in this case as everyone not currently hanging upside down over a scorpion pit]']
1# Instantiating the vectorizer with a vocabulary size of 5
2vectorizer_tfidf = TfidfVectorizer(max_features=5)
1# Fitting the vectorizer on the raw data of txt_sents
2vectorizer_tfidf.fit(txt_sents)
TfidfVectorizer(max_features=5)
1# Printing the vocabulary learned by the vectorizer
2vectorizer_tfidf.vocabulary_
{'of': 1, 'to': 4, 'this': 3, 'the': 2, 'citizens': 0}
1# Transforming the data using the trained vectorizer
2txt_tfidf = vectorizer_tfidf.transform(txt_sents).toarray()
3print(txt_tfidf)
[[0.22416044 0.67248131 0.22416044 0.22416044 0.63009934]
 [0.5        0.5        0.5        0.5        0.        ]]
1# Printing the IDF values for the terms using the idf_ attribute
2vectorizer_tfidf.idf_
array([1.        , 1.        , 1.        , 1.        , 1.40546511])