Documentation: https://www.nltk.org/api/nltk.html
Tokenization
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. For example it knows that the dot in Johan S. Bach does not mark the end of a sentence.
Tagging
A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token)
. For example, the following tagged token combines the word 'fly'
with a noun part of speech tag ('NN'
):
Utils
Search in text
text.concordance(other_text)
Find similar words
text.similar()
You can visualize the places where a given word appears in a text with a dispersion plot
text.dispersion_plot([word1, word2, word3])
corpus
Useful methods
- words(): list of str
- sents(): list of (list of str)
- paras(): list of (list of (list of str))
- tagged_words(): list of (str,str) tuple
- tagged_sents(): list of (list of (str,str))
- tagged_paras(): list of (list of (list of (str,str)))
- chunked_sents(): list of (Tree w/ (str,str) leaves)
- parsed_sents(): list of (Tree with str leaves)
- parsed_paras(): list of (list of (Tree with str leaves))
- xml(): A single xml ElementTree
- raw(): unprocessed corpus contents
Statistics
nltk.FreqDist
https://www.nltk.org/api/nltk.probability.FreqDist.html
There is a most_common method that is very useful for visualization
Stop words
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('english'))