Blog

Beginning NLP Natural Language Processing— Tokenize Words and Sentences, Find Frequencies, Lemmatize, Get Synonyms

In this post we will start NLP=Natural Language Processing. Use NLTK Natural Language Toolkit to perform some preliminary processing.
We will read input, tokenize sentences, words, find frequencies of words, remove common/stop words from the input, find synonyms and draw a few plots.

Let us start with tokenizing some text for sentences and for words. This is the text we are going to use.

The ABCD lesson. The apple was placed in a ball and the cat was playing with it.
The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong.

To tokenize for sentences we use

sent_tokenize

and for words we use

word_tokenize


from nltk.tokenize import sent_tokenize as st
from nltk.tokenize import word_tokenize as wt

text=''' The ABCD lesson. The apple was placed in a ball and the cat was playing with it.
The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong.
'''
sentences =st(text)
words =wt(text)
print("Sentences\n\n")
print(sentences)
print("\n\nWords\n\n")
print(words)

Sentences

[‘ The ABCD lesson.’, ‘The apple was placed in a ball and the cat was playing with it.’, ‘The dog hated the cat but loved the ball, he did not care for the apple.’, ‘An apple is different from the apple, a apple is wrong.’]

Words

[‘The’, ‘ABCD’, ‘lesson’, ‘.’, ‘The’, ‘apple’, ‘was’, ‘placed’, ‘in’, ‘a’, ‘ball’, ‘and’, ‘the’, ‘cat’, ‘was’, ‘playing’, ‘with’, ‘it’, ‘.’, ‘The’, ‘dog’, ‘hated’, ‘the’, ‘cat’, ‘but’, ‘loved’, ‘the’, ‘ball’, ‘,’, ‘he’, ‘did’, ‘not’, ‘care’, ‘for’, ‘the’, ‘apple’, ‘.’, ‘An’, ‘apple’, ‘is’, ‘different’, ‘from’, ‘the’, ‘apple’, ‘,’, ‘a’, ‘apple’, ‘is’, ‘wrong’, ‘.’]

You can see that it works properly.

Let us get a frequency count of the words being used. To do that we use the FreqDist of the nltk.

Ths function returns a list of tuples where words are the keys and frequency the value.
The following code will print the words, frequencies table.
The following code will print a list of words and frequencies.
</pre>
import nltk
from nltk.tokenize import word_tokenize as wt

text=''' The ABCD lesson. The apple was placed in a ball and the cat was playing with it.
The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong.
'''

words =wt(text)
print("\n\nWords\n\n")
wordswithfrequencies=nltk.FreqDist(words)
keyvaluepairs=wordswithfrequencies.items()
print(keyvaluepairs)
for key,val in keyvaluepairs:

print (str(key) + ':' + str(val))


<pre>

Words

dict_items([(‘The’, 3), (‘ABCD’, 1), (‘lesson’, 1), (‘.’, 4), (‘apple’, 5), (‘was’, 2), (‘placed’, 1), (‘in’, 1), (‘a’, 2), (‘ball’, 2), (‘and’, 1), (‘the’, 5), (‘cat’, 2), (‘playing’, 1), (‘with’, 1), (‘it’, 1), (‘dog’, 1), (‘hated’, 1), (‘but’, 1), (‘loved’, 1), (‘,’, 2), (‘he’, 1), (‘did’, 1), (‘not’, 1), (‘care’, 1), (‘for’, 1), (‘An’, 1), (‘is’, 2), (‘different’, 1), (‘from’, 1), (‘wrong’, 1)])
The:3
ABCD:1
lesson:1
.:4
apple:5
was:2
placed:1
in:1
a:2
ball:2
and:1
the:5
cat:2
playing:1
with:1
it:1
dog:1
hated:1
but:1
loved:1
,:2
he:1
did:1
not:1
care:1
for:1
An:1
is:2
different:1
from:1
wrong:1
nltk.FreqDist can be used as a dictionary in the following manner.

print(wordswithfrequencies[‘balloon’]) =0
print(wordswithfrequencies[‘ball’]) = 2

 

To plot the frequencies distribution we will use.

n=len(wordswithfrequencies)
wordswithfrequencies.plot(n, cumulative=False)

 

Cumulative can be printed using

wordswithfrequencies.plot(n, cumulative=True)

NLTK maintains certain corpuses that is word lists of different types. Among them we have a corpus of stop words. Here is a listing that I accessed va this code


import nltk
from nltk.corpus import stopwords as sw
stops = sw.words('english')
print(stops)

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”, “you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, “she’s”, ‘her’, ‘hers’, ‘herself’, ‘it’, “it’s”, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, “that’ll”, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, “don’t”, ‘should’, “should’ve”, ‘now’, ‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, “aren’t”, ‘couldn’, “couldn’t”, ‘didn’, “didn’t”, ‘doesn’, “doesn’t”, ‘hadn’, “hadn’t”, ‘hasn’, “hasn’t”, ‘haven’, “haven’t”, ‘isn’, “isn’t”, ‘ma’, ‘mightn’, “mightn’t”, ‘mustn’, “mustn’t”, ‘needn’, “needn’t”, ‘shan’, “shan’t”, ‘shouldn’, “shouldn’t”, ‘wasn’, “wasn’t”, ‘weren’, “weren’t”, ‘won’, “won’t”, ‘wouldn’, “wouldn’t”]

These are common words that have no specific contextual meaning and should be removed from the word tokens we are analyzing.

Let us do it.


import nltk
from nltk.tokenize import word_tokenize as wt
from nltk.corpus import stopwords as sw
text=''' The ABCD lesson. The apple was placed in a ball and the cat was playing with it.
The dog hated the cat but loved the ball, he did not care for the apple. An apple is different from the apple, a apple is wrong.
'''
stops = sw.words('english')
tokens=wt(text)
print("Before removal\n")
print(tokens)
for token in tokens:
if token in stops:

tokens.remove(token)
print("\nAfter removal\n")
print(tokens)

Before removal

[‘The’, ‘ABCD’, ‘lesson’, ‘.’, ‘The’, ‘apple’, ‘was’, ‘placed’, ‘in’, ‘a’, ‘ball’, ‘and’, ‘the’, ‘cat’, ‘was’, ‘playing’, ‘with’, ‘it’, ‘.’, ‘The’, ‘dog’, ‘hated’, ‘the’, ‘cat’, ‘but’, ‘loved’, ‘the’, ‘ball’, ‘,’, ‘he’, ‘did’, ‘not’, ‘care’, ‘for’, ‘the’, ‘apple’, ‘.’, ‘An’, ‘apple’, ‘is’, ‘different’, ‘from’, ‘the’, ‘apple’, ‘,’, ‘a’, ‘apple’, ‘is’, ‘wrong’, ‘.’]

After removal

[‘The’, ‘ABCD’, ‘lesson’, ‘.’, ‘The’, ‘apple’, ‘placed’, ‘ball’, ‘cat’, ‘playing’, ‘it’, ‘.’, ‘The’, ‘dog’, ‘hated’, ‘cat’, ‘loved’, ‘the’, ‘ball’, ‘,’, ‘did’, ‘care’, ‘the’, ‘apple’, ‘.’, ‘An’, ‘apple’, ‘different’, ‘the’, ‘apple’, ‘,’, ‘a’, ‘apple’, ‘wrong’, ‘.’]

The beginning of a sentence is never removed.

 

Lemmatize

To lemmatize is to find the root of a word.

Check up the following code and output.


from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("modelling",pos='v'))
print(lemmatizer.lemmatize("programs"))
print(lemmatizer.lemmatize("diving",pos='v'))
print(lemmatizer.lemmatize("diving",pos='n'))
print(lemmatizer.lemmatize("better",pos='a'))
print(lemmatizer.lemmatize("worse",pos='a'))
print(lemmatizer.lemmatize("worst",pos='a'))

model
program
dive
diving
good
bad
bad

The lemmatize function takes two inputs. One is the word, the other is pos=part of speech.

v for verb, n for noun, a for adjective, r for adverb. pos=’n’ is the default.

Finding Antonyms and Synonyms


from nltk.corpus import wordnet

synonyms = []
antonyms = []
word="fresh"
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
print("Synonyms of Fresh\n",synonyms)
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())
print("\nAntonyms of Fresh\n",antonyms)

Synonyms of Fresh
[‘fresh’, ‘fresh’, ‘bracing’, ‘brisk’, ‘fresh’, ‘refreshing’, ‘refreshful’, ‘tonic’, ‘fresh’, ‘new’, ‘novel’, ‘fresh’, ‘fresh’, ‘sweet’, ‘fresh’, ‘fresh’, ‘invigorated’, ‘refreshed’, ‘reinvigorated’, ‘fresh’, ‘sweet’, ‘unfermented’, ‘clean’, ‘fresh’, ‘fresh’, ‘unused’, ‘fresh’, ‘impertinent’, ‘impudent’, ‘overbold’, ‘smart’, ‘saucy’, ‘sassy’, ‘wise’, ‘newly’, ‘freshly’, ‘fresh’, ‘new’]

Antonyms of Fresh
[‘stale’, ‘preserved’, ‘salty’]

 

end

Leave a Reply

Your email address will not be published. Required fields are marked *