#machinelearning #nlp

Machine learning

A token is the technical name for a sequence of characters that we want to treat as a group.

Corpora is the body of text: medical journals, presidential speeches, the English language

Lexicon is the words and this meanings. The is no single lexicon as in investment-speak bull means one thing and in regular English means another.

Stop words are words filterd out of a sentence before doing NLP analysis because they are deemed insignificant.

Stemming is the act of reducing words to its root form (stem). For example run and running share the same root, run. Now it is not generally used since word2vec already does a better work at tokenization. Stems are not necessarily words.

A tag is a case-sensitive string that specifies some property of a token, such as its part of speech.

NLTK

Books

A book to go deep

https://web.stanford.edu/~jurafsky/slp3/

Courses

https://huggingface.co/learn/nlp-course/chapter1/1