Lemmatization is the process of converting a word to its meaningful root form.
Example : Lemmatization of the word “writing” is “write”.
In Lemmatization the root word is called Lemma. A Lemma is the canonical/dictionary/citation form of a set of words.
For example good, better, best are all forms of the word good. Therefore good is the lemma of all these words.
Lemmatization is one of the important preprocessing technique in Python. Though stemming and lemmatization both used in preprocessing but lemmatization is most preferable because it returns the correct root form whereas stemming only removes the last few characters.
Lemmatization Example:
dancing – Lemmatization – dance
dancing – stemming – danc
Here you can see that in lemmatization we can find meaningful root word.
To implement lemmatization in Python, we will use wordnet lemmatizer using NLTK package. Below is the code.
Here I am using Google collab for writing code because in Google Collab you don’t have to install the packages which saves a lot of time. You can use any editor.
# import these modules
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
#create an object of WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("loves :", lemmatizer.lemmatize("loves"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos", v denotes 'verb'
print("running :", lemmatizer.lemmatize("running", pos ="v"))
print("better :",lemmatizer.lemmatize("better", pos ="a"))
print("lesser :",lemmatizer.lemmatize("lesser", pos ="a"))
Output is as below .
I hope the article will help you a lot. This is just a short overview of lemmatization. If you are facing any difficult in implementing lemmatization please post in comment section.
Happy Coding!