Documentation: https://www.nltk.org/api/nltk.html

Tokenization

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. For example it knows that the dot in Johan S. Bach does not mark the end of a sentence.

Tagging

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

Utils

Search in text

text.concordance(other_text)

Find similar words

text.similar()

You can visualize the places where a given word appears in a text with a dispersion plot

text.dispersion_plot([word1, word2, word3])

corpus

Useful methods

Statistics

nltk.FreqDist

https://www.nltk.org/api/nltk.probability.FreqDist.html

There is a most_common method that is very useful for visualization

Stop words

import nltk
from nltk.corpus import stopwords
 
nltk.download('stopwords')
print(stopwords.words('english'))