Natural Language Processing (NLP)#

Using the Natural Language Toolkit for text preprocessing of raw text data, tokenizing the raw text and removing punctuations and stop words. Implementing classical approaches to text representation, such as one-hot encoding and the TF-IDF approach.

These notebooks serve us to practice some DL techniques, and as snippets to build on. This is just a first introduction, more follows in Natural language processing.

Notes#

  • Bag-of-words representations can not capture semantic associations easily.

  • Each language would have to have an extensive dictionary of words on file, which would take a relatively long time to search through.

  • Bag-of-words will fail if none of the words in the training set are included in the testing set.