– Unstructured Data Analytics – An Introduction to Stemming and Lemmatization Techniques –
Stemming and Lemmatization functions are an important part of text mining. These functions review text and remove inflection endings of a word and sometimes derivationally related forms of a word to return the base or dictionary form of the word. For example, when applying a stemming or lemmatization technique, the words, discover, discovering, and discovered may be modified by one of these functions to return the base of discov. Once a stemming or lemmatization process is applied, and the algorithm consistently produces a base for all words in a document, you can then apply additional text mining functions with more accurate results.
In language, words are modified with several types of inflections to give the form of the word added meaning. For example, an inflection can be in the form of gender, person, quantity, tense, etc. Here are some examples of word inflections:
|Gender||host (male), hostess (female)|
|Verb Tense||walk, walked, walking|
|Person||I sing (1st person) she sings (3rd Person)|
|Case Variation||I, my, me|
|Quantity||goose (singular) geese (plural)|
Stemming vs. Lemmatization
You can apply several stemming and lemmatization variations to text data, and it is important to know the difference before deciding on a stemming or lemmatization technique. Here is a brief definition of the stemming and lemmatization to help distinguish the difference between these two options:
Stemming: A process for reducing inflected (or sometimes derived) words to their word stem, base or root form. Stemming is more of a crude process that “chops” the ends of words.
Lemmatization: A process that uses vocabulary and morphological analysis of words, aiming to remove inflectional endings only and to return the base or dictionary form of a word. The base is called a “lemma.”
Lemmatization is more formal or “proper” because instead of just chopping the end of words to try and get to a base, there is more analysis done with lemmatization to get to the true lemma of the word. For example, “be” is the lemma of the words “am”, “is”, “being” and “was“. Stemming does not always make the same distinction. Lemmatization is also much more complicated than stemming because it also uses vocabulary and morphological analysis to understand context before deriving the root word or lemma. For example, lemmatization would distinguish a word like “drawer” based on context to mean either a storage compartment (lemma=drawer) or a person that draws (lemma=draw). Based on the added complexity, lemmatization will also generally require more time for a computer program to process compared to stemming. However, practically, stemming can be just as effective and is more efficient than lemmatization.
Use Stemming with Caution
Because stemming is less complicated and does not require the context analysis of lemmatization, text mining results may be less accurate if you apply the wrong stemming algorithm to your text data.
There are variations in stemming programs that can over-stem or under-stem your text data. Under-stemming algorithms will fail to generate the same root for two words that have the same meaning. For example, the word “climbed” will return “climb” while “climbing” may return “climbi.” In this case, a text mining algorithm would not determine these two words to be the same even though they have the same meaning. Over-stemming algorithms will create the same root word for two words that have different meanings. For example, both the words “standard” and “standing” if over-stemmed may both return the stem of “stand.” In this case, a text mining algorithm would assume the words are the same even though the meanings are completely different.
Variations of Stemming Algorithms
Since there several variations to stemming and based on the issues that can occur if documents are over or under-stemmed, it is important to understand different stemming techniques, the two most commonly used algorithms being the Porter Stemmer and the Snowball Stemmer. In 1980 Dr. Martin Porter wrote and published the Porter Stemmer algorithm. The Porter Stemmer quickly became the standard algorithm used for English stemming. Other people wrote and distributed many variations to the Porter Stemmer, but many of these implementations contained errors. To correct these errors, in early 2000, Dr. Porter released an updated software version of the algorithm that became the Snowball stemmer. The Snowball algorithm provided a framework for an improved English stemmer and an extension for stemming for other languages as well. For more information on Porter and Snowball stemming please visit these sites:
Stemming and lemmatization algorithms provide a similar function but can produce drastically different data output. When text mining, depending on the desired outcome, you will need to determine which variation of stemming or lemmatization to use to get the most desirable results.