– Unstructured Data Analytics – Complexity in Similarity Matching –

When working with unstructured data analytics, one of the more complex problems to address is dealing with two comparable words when trying to execute similarity matching algorithms.  Consider leveraging unstructured data analytics to find two sentences that are similar.  When working with unstructured customer service data, there are several applications for finding similar sentences to cluster concepts.  For example, when working with customer comments to cluster customer feedback or when working with customer requests to discover redundancy for process automation.

One of the key opportunities to improve similarity matching results is to ensure you have a good understanding of the difference between synonymy, homonymy, and polysemy so that you can better determine how to program similarity matching to address each situation.

To begin, consider the following simple definition of each term:

  • Synonymy [synonyms (noun), synonymous (adjective)]: Two words that have equivalent or similar meanings – Examples: definition and meaning, someone and somebody, change and alter
  • Homonymy [homonyms (noun), homonymous (adjective)]: Two words that either sound the same when spoken, have the same spelling, or both and have different meanings – Examples: bass, accent, bear and bare, steel and steal
  • Polysemy [polysemic word (noun), polysemous (adjective)]: One word that can be used to express different meanings and the difference can be obvious or subtle – Examples: newspaper, crane, milk

There is danger in building similarity models, making assumptions when programming and taking into consideration each word comparison situation.  Unfortunately, at this point, there have not been many algorithms that accurately deal with similarity matching and comparing words in every scenario.  Consider the following more detailed explanation of the complexity of similarity matching in each situation:


Complexity with Synonyms

Dealing with synonyms as part of unstructured data analytics at first may seem to be a simple task. The definition of “synonym” itself is widely known and understood by most people.  But the level that two words are synonymous adds several layers of complexity when trying to find two documents with comparable meaning.  A “true synonym” is defined as two words that have the exact same meaning.  Many linguistic scholars argue that in modern language, two synonyms do not exist and that as humans (or as part of our social climate), we have an aversion to synonymy.  If two words evolve that appear to be true synonyms, we systematically try and find the difference in the two words, even if it is just a subtle difference or connotation to distinguish one word from the other.  It is our human brain (or our social climates) that recognizes the redundancy in “true synonyms” and will fight to find the difference to remove the redundancy.

If “true synonyms” don’t exist, then our definition of synonyms should be clarified as two words that have a similar meaning.  The challenge with synonyms then becomes trying to determine, on a spectrum/index, how similar two synonyms are to each other.  Synonyms may be remarkably similar, say 99%, like the words solve and resolve while other synonyms like help and support are only 65% synonymous.  So, to conduct similarity matching of short sentences, in theory, there are two ways to consider dealing with synonyms:

  1. The first would be to generate a synonym replacer and replace all synonyms with just one of the synonymous words before conducting a similarity matching algorithm. To build the synonym replacer, you would need to determine on the index how closely you would allow the synonyms to match before executing the replacement (i.e. replace all synonyms that match at a threshold of 90% or more).  If you use this method, you should program flexibility to adjust the threshold to tune results.
  1. The second method is to leverage the actual index score in the similarity matching and use the index as part of the calculation instead of replacing the synonym. This method will provide better results, but the programming is much more complicated and introduces more opportunity for error.

Additionally, there is not a reliable way to establish an accurate similarity index because the level of similarity between to synonyms may differ depending on context.  For example, consider the synonyms “link” and “bridge”.  In some cases, sentences using the two words can be very comparable and more similar as demonstrated in this example:

“There is a link between the concepts of atheism and agnosticism.”

“I can bridge the gap between what an atheist believes and what an agnostic believes.”

However, based on context sentences using the same two words in different sentences may have completely different meaning and treating the words as synonyms would result in less accurate results as conveyed in these two sentences:

“This chain has 62 links.”

“I crossed several bridges while walking beside the river.”

Keeping this in mind, it becomes necessary to understand the difference between synonymy, homonymy, and polysemy.


Complexity with Homonyms

Generally, homonyms are defined as any two words that have different meaning but either sound the same when spoken, have the same spelling, or both.  The following provides distinction between the various types of homonyms:

  • Homographs: Two words with the same spelling but different pronunciation and different meanings:
    Example: dove
    “I dove into the lake.”
    “I saw a lovely dove in a tree today.”
  • Homophones: Two words with different spelling but the same pronunciation and different meanings:
    Example: aunt and ant
    “I invited my aunt to dinner.”
    “There was an ant on the dinner table.”
  • Homonyms: Two words with the same spelling and the same pronunciation but different meanings (just referred to as homonyms):
    Example: duck
    “I had to duck to avoid hitting my head.”
    “We fed a duck some bread by the pond.”

When dealing with unstructured data analytics, homographs and homonyms are more complicated because the distinction between the two words can only be made based on the context in a sentence.  Homophones, on the other hand, are not a complicated due to the difference in spelling unless speech recognition software is used as part of the technology solution.

It is evident why homonyms create complexity when performing similarity matching in unstructured data analytics.  A computer program is less likely to be able to distinguish two words if the spelling is the same but the meaning is different (regardless if the pronunciation is the same or not).  Thus, complex programming may be required to effectively deal with homonyms in similarity matching algorithms, and it also is important to understand the difference between homonymy and polysemy.

As mentioned above, homonyms are two words that that either sound the same when spoken and (or) have the same spelling but have different meanings.  Alternatively, a word that is polysemous can have a different meaning, and the difference can be obvious or subtle.  A polysemic word may even be distinguished based on the part of speech of the word (i.e. used as a noun opposed to a verb).  At times, if the distinction is subtle, it may be difficult to distinguish if a word is a homonym or polysemic. When you consider the difference try and determine if the word is two separate words with different meanings (a homonym) or one word with the same meaning but used in different ways (polysemic).  Here are some examples:

  • Obvious Homonymy: down

    “The cat climbed down the tree.”
    “The down blanket is warm.”
  • Obvious Polysemy: wood

    “The wood on that tree is burnt.”
    “Be careful not to go walking at night alone in the wood.”  
  • Polysemy that is also Homonymous: bank

    “I made a deposit at the bank.”
    Polysemy Example: “You can bank on me.” (as in “depend on”)

    Homonymy Example: “Let’s sit on the bank of the river.”



When programming similarity matching for unstructured data analytics, it is important to understand these distinctions even if just to determine why an algorithm is not producing expected results.  There are several ways to manipulate the algorithm to take synonymy, homonymy, and polysemy into consideration but it is near impossible to account for every scenario based on the complexly of the human language.  The goal of understanding this complexity is to try and take different scenarios into consideration and program your unstructured data analytic algorithms to be as accurate as possible.