In this post we will start NLP=Natural Language Processing. Use NLTK Natural Language Toolkit to perform some preliminary processing.
We will read input, tokenize sentences, words, find frequencies of words, remove common/stop words from the input, find synonyms and draw a few plots.
Let us start with tokenizing some text for sentences and for words. This is the text we are going to use.
The ABCD lesson. The apple was placed in a ball and the cat was playing with it.
The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong.
To tokenize for sentences we use
sent_tokenize
and for words we use
word_tokenize
from nltk.tokenize import sent_tokenize as st from nltk.tokenize import word_tokenize as wt text=''' The ABCD lesson. The apple was placed in a ball and the cat was playing with it. The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong. ''' sentences =st(text) words =wt(text) print("Sentences\n\n") print(sentences) print("\n\nWords\n\n") print(words)
Sentences
[‘ The ABCD lesson.’, ‘The apple was placed in a ball and the cat was playing with it.’, ‘The dog hated the cat but loved the ball, he did not care for the apple.’, ‘An apple is different from the apple, a apple is wrong.’]
Words
[‘The’, ‘ABCD’, ‘lesson’, ‘.’, ‘The’, ‘apple’, ‘was’, ‘placed’, ‘in’, ‘a’, ‘ball’, ‘and’, ‘the’, ‘cat’, ‘was’, ‘playing’, ‘with’, ‘it’, ‘.’, ‘The’, ‘dog’, ‘hated’, ‘the’, ‘cat’, ‘but’, ‘loved’, ‘the’, ‘ball’, ‘,’, ‘he’, ‘did’, ‘not’, ‘care’, ‘for’, ‘the’, ‘apple’, ‘.’, ‘An’, ‘apple’, ‘is’, ‘different’, ‘from’, ‘the’, ‘apple’, ‘,’, ‘a’, ‘apple’, ‘is’, ‘wrong’, ‘.’]
You can see that it works properly. Let us get a frequency count of the words being used. To do that we use the FreqDist of the nltk. Ths function returns a list of tuples where words are the keys and frequency the value. The following code will print the words, frequencies table. The following code will print a list of words and frequencies.
</pre> import nltk from nltk.tokenize import word_tokenize as wt text=''' The ABCD lesson. The apple was placed in a ball and the cat was playing with it. The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong. ''' words =wt(text) print("\n\nWords\n\n") wordswithfrequencies=nltk.FreqDist(words) keyvaluepairs=wordswithfrequencies.items() print(keyvaluepairs) for key,val in keyvaluepairs: print (str(key) + ':' + str(val)) <pre>
Words
dict_items([(‘The’, 3), (‘ABCD’, 1), (‘lesson’, 1), (‘.’, 4), (‘apple’, 5), (‘was’, 2), (‘placed’, 1), (‘in’, 1), (‘a’, 2), (‘ball’, 2), (‘and’, 1), (‘the’, 5), (‘cat’, 2), (‘playing’, 1), (‘with’, 1), (‘it’, 1), (‘dog’, 1), (‘hated’, 1), (‘but’, 1), (‘loved’, 1), (‘,’, 2), (‘he’, 1), (‘did’, 1), (‘not’, 1), (‘care’, 1), (‘for’, 1), (‘An’, 1), (‘is’, 2), (‘different’, 1), (‘from’, 1), (‘wrong’, 1)])
The:3
ABCD:1
lesson:1
.:4
apple:5
was:2
placed:1
in:1
a:2
ball:2
and:1
the:5
cat:2
playing:1
with:1
it:1
dog:1
hated:1
but:1
loved:1
,:2
he:1
did:1
not:1
care:1
for:1
An:1
is:2
different:1
from:1
wrong:1
nltk.FreqDist can be used as a dictionary in the following manner.
print(wordswithfrequencies[‘balloon’]) =0
print(wordswithfrequencies[‘ball’]) = 2
To plot the frequencies distribution we will use.
n=len(wordswithfrequencies)
wordswithfrequencies.plot(n, cumulative=False)
Cumulative can be printed using
wordswithfrequencies.plot(n, cumulative=True)
NLTK maintains certain corpuses that is word lists of different types. Among them we have a corpus of stop words. Here is a listing that I accessed va this code
import nltk from nltk.corpus import stopwords as sw stops = sw.words('english') print(stops)
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”, “you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, “she’s”, ‘her’, ‘hers’, ‘herself’, ‘it’, “it’s”, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, “that’ll”, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, “don’t”, ‘should’, “should’ve”, ‘now’, ‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, “aren’t”, ‘couldn’, “couldn’t”, ‘didn’, “didn’t”, ‘doesn’, “doesn’t”, ‘hadn’, “hadn’t”, ‘hasn’, “hasn’t”, ‘haven’, “haven’t”, ‘isn’, “isn’t”, ‘ma’, ‘mightn’, “mightn’t”, ‘mustn’, “mustn’t”, ‘needn’, “needn’t”, ‘shan’, “shan’t”, ‘shouldn’, “shouldn’t”, ‘wasn’, “wasn’t”, ‘weren’, “weren’t”, ‘won’, “won’t”, ‘wouldn’, “wouldn’t”]
These are common words that have no specific contextual meaning and should be removed from the word tokens we are analyzing.
Let us do it.
import nltk from nltk.tokenize import word_tokenize as wt from nltk.corpus import stopwords as sw text=''' The ABCD lesson. The apple was placed in a ball and the cat was playing with it. The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong. ''' stops = sw.words('english') tokens=wt(text) print("Before removal\n") print(tokens) for token in tokens: if token in stops: tokens.remove(token) print("\nAfter removal\n") print(tokens)
Before removal
[‘The’, ‘ABCD’, ‘lesson’, ‘.’, ‘The’, ‘apple’, ‘was’, ‘placed’, ‘in’, ‘a’, ‘ball’, ‘and’, ‘the’, ‘cat’, ‘was’, ‘playing’, ‘with’, ‘it’, ‘.’, ‘The’, ‘dog’, ‘hated’, ‘the’, ‘cat’, ‘but’, ‘loved’, ‘the’, ‘ball’, ‘,’, ‘he’, ‘did’, ‘not’, ‘care’, ‘for’, ‘the’, ‘apple’, ‘.’, ‘An’, ‘apple’, ‘is’, ‘different’, ‘from’, ‘the’, ‘apple’, ‘,’, ‘a’, ‘apple’, ‘is’, ‘wrong’, ‘.’]
After removal
[‘The’, ‘ABCD’, ‘lesson’, ‘.’, ‘The’, ‘apple’, ‘placed’, ‘ball’, ‘cat’, ‘playing’, ‘it’, ‘.’, ‘The’, ‘dog’, ‘hated’, ‘cat’, ‘loved’, ‘the’, ‘ball’, ‘,’, ‘did’, ‘care’, ‘the’, ‘apple’, ‘.’, ‘An’, ‘apple’, ‘different’, ‘the’, ‘apple’, ‘,’, ‘a’, ‘apple’, ‘wrong’, ‘.’]
The beginning of a sentence is never removed.
Lemmatize
To lemmatize is to find the root of a word.
Check up the following code and output.
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("modelling",pos='v')) print(lemmatizer.lemmatize("programs")) print(lemmatizer.lemmatize("diving",pos='v')) print(lemmatizer.lemmatize("diving",pos='n')) print(lemmatizer.lemmatize("better",pos='a')) print(lemmatizer.lemmatize("worse",pos='a')) print(lemmatizer.lemmatize("worst",pos='a'))
model
program
dive
diving
good
bad
bad
The lemmatize function takes two inputs. One is the word, the other is pos=part of speech.
v for verb, n for noun, a for adjective, r for adverb. pos=’n’ is the default.
Finding Antonyms and Synonyms
from nltk.corpus import wordnet synonyms = [] antonyms = [] word="fresh" for syn in wordnet.synsets(word): for lemma in syn.lemmas(): synonyms.append(lemma.name()) print("Synonyms of Fresh\n",synonyms) for syn in wordnet.synsets(word): for lemma in syn.lemmas(): if lemma.antonyms(): antonyms.append(lemma.antonyms()[0].name()) print("\nAntonyms of Fresh\n",antonyms)
Synonyms of Fresh
[‘fresh’, ‘fresh’, ‘bracing’, ‘brisk’, ‘fresh’, ‘refreshing’, ‘refreshful’, ‘tonic’, ‘fresh’, ‘new’, ‘novel’, ‘fresh’, ‘fresh’, ‘sweet’, ‘fresh’, ‘fresh’, ‘invigorated’, ‘refreshed’, ‘reinvigorated’, ‘fresh’, ‘sweet’, ‘unfermented’, ‘clean’, ‘fresh’, ‘fresh’, ‘unused’, ‘fresh’, ‘impertinent’, ‘impudent’, ‘overbold’, ‘smart’, ‘saucy’, ‘sassy’, ‘wise’, ‘newly’, ‘freshly’, ‘fresh’, ‘new’]
Antonyms of Fresh
[‘stale’, ‘preserved’, ‘salty’]
end