At CSLD we strive to create blog content on how to provide superior customer service. We also create blogs to provide more technical information on data analytics and unstructured text data. This week’s blog is more technical with a focus on the effective use of stop words to optimize unstructured text data analytics.
Effectively Using Stop Words
The effective use of stop words is an essential concept to understand before working with unstructured text data. Stop words are words that data algorithms remove from a text string as part of natural language processing. Stop words provide little value to the overall data processing objective and are removed to achieve better results. Typical stops words are determiners, coordinating conjunctions and prepositions.
- Determiners are words the usually mark and precede a noun
Examples of determiners: “a”, “the”, “an”
- Coordinating conjunctions are words that connect other words or phrases
Examples of coordinating conjunctions: “and”, “but”, “yet”
- Prepositions are words that govern a noun or pronoun and express a relation to another word or element
Examples of prepositions: “in”, “above”, “after”
It is important to understand when to apply a stop word list and when to avoid using stop words. It is also important to ensure that you are using a stop word list to facilitate (not hinder) your data processing objectives. When working with stop words, data processing tools reference a file that contains a list of the stop words to remove as a step in the data analysis process. While there are some commonly used stop word files, there is not a universal list of stop words. Therefore, any group of words can become a stop word list for a particular data processing purpose.
Deciding when to use Stop Words
Different data processing algorithms will leverage stop words for various functions. For example,
- Supervised machine learning will remove stop words from the feature space
- Information retrieval algorithms will not index stop words
- Clustering algorithms will eliminate stop words to generate more accurate clusters
There are however times when some data algorithms intentionally will not remove stop words when processing text data. For example, stop words can be important when searching phrases. When you remove stop words from indexing some phrases, the search results may not be as effective. Searching for lyrics in songs and searching for famous quotes are examples when all words in the sentence are necessary to ensure effective search results.
Deciding what type of Stop Word File to Use
When deciding on your stop word list, consider the result of the data analysis process before deciding on stop words. For example, when clustering a data set, you may choose to leverage a large stop word file to remove all determiners, coordinating conjunctions, prepositions, and even some adjectives (words like “bad” and “friendly”). If you applied the same stop-word list to a sentiment analysis algorithm, excluding the adjectives from your data would severely impact the sentiment analysis results.
The following are a few general English stop word resources to use as a starting point for your stop word list when performing generic data analytics:
- Snowball: http://snowball.tartarus.org/algorithms/english/stop.txt
- Terrier: http://terrier.org/docs/v2.2.1/javadoc/uk/ac/gla/terrier/terms/Stopwords.html
- Ranks NL: http://www.ranks.nl/stopwords
Building a Custom Stop Word List
Using a generic stop word file for basic text data analytics may be sufficient. However, building a custom stop word file may help you achieve better results. In an industry or some domain-specific scenarios, a custom stop word file will be more effective. Custom stop word files contain words that are common and overused for that industry or domain. For example, if you are working with data for a particular company, you could add the name of the company to your stop word file if not needed for your data analytics.
A general strategy for building a custom stop word list is to sort all the words in your text based on term frequency. Term frequency is the total number of times each word appears in the complete data set. Sort the words with the most frequently used words at the top of the list. Then, consider adding the most frequently used words to your stop word file.
Regardless of how you develop your stop word list, it is important to understand how to use the list. Stop words can either hinder or hurt your results. Effective use of stop words will help you achieve the most optimal outcome.