Fuzzy Matching comes into play when simple comparison operators can't distinguish the exact text pattern while removing duplicates. For example : Microsoft Inc. and Microsoft represent the same organization but comparison operator usually fails to get the similarity between those terms.
>> Str1 = "Microsoft Inc."
>> Str2 = "Microsoft Inc."
>> Result = Str1 == Str2
>> print(Result)
True
The above simple comparison might give you result as True since both the strings are matching exactly with no difference in the pattern. If it is misspelled or changed the case, it results False.
To avoid such cases, we can use FuzzyWuzzy package in python to compare and get the similarity between any textual patterns.
Fuzzy logic is a methodology to evaluate the "degrees of truth" rather than the usual Boolean logic "true or false" approach. The values of truth may vary between 0 and 1.
Fuzzy String Matching applies approximate string matching rather than the exact string matching.
FuzzyWuzzy package in python uses the popular standard Levenshtein distance ration of similarity between two strings.
Let's see how it works on jupyter notebook :
Launch a blank jupyter notebook for python and install required packages
Import the modules of fuzzywuzzy :
We can now start applying the algorithm and get the similarity scores between the strings :
fuzz.partial_ratio() looks for substrings of the larger string in both the strings compared and gets the score. Given two strings X and Y, let the shorter string (X) be of length p. It finds the fuzzy wuzzy ratio similarity measure between the shorter string and every substring of length p of the longer string, and returns the maximum of those similarity measures.
fuzz.token_sort_ratio ignores the order in the strings gives the same score :
fuzz.token_set_ratio gives the same score disregarding the repetition of any token in the string :
based on the type of data set you are using, you can chose the options and apply on the data. Python offers flexible and simple options to handle the text similarity problems.
Comments
Post a Comment