Fuzzy String
Matching: It is also referred as approximate string matching. It is useful
where we want to search for approximate match between two sequences. There are
number of ways we can do this. In this blog we will consider some JAVA
libraries and code to use approximate string match. Practically it is useful in
providing spelling suggestions, correcting human errors or searching over database
when exact match is not available.
Levenshtein
Distance (LD): It is also known as edit distance in computer science. It is
referred as minimum number of insertion, deletion or replacement of characters
needed to convert one string into another.
Example:
Beat – Best (LD
= 1, replacement of 1 character e->s)
Beat – eat (LD
= 1, deletion of 1 character ‘B’)
In Java Apache
Commons Lang contains method to get Levenshtein Distance. First we need to
download it’s jar file by this link
Java code to
get LD :
import org.apache.commons.lang3.StringUtils;
public class FuzzyDistance {
public
static void main(String[] args)
{
String
str1 = “Skynet”;
String
str2 = “SkyHigh”;
int
distance = StringUtils.getLevenshteinDistance(str1, str2);
System.out.println(“Distance
:” + distance);
}
}
Output: Distance : 4
Jaro-Winker Distance: It is also a
measure of similarity between two strings. The higher value of JWD between two
strings indicate more similarity. It is based on the fact that difference at
the start of the string sequences are more significant than end of the sequences.
This algorithm is also available in Apache Common Lang.
String str1 = “Skynet”;
String
str2 = “SkyHigh”;
int distance
= StringUtils.getJaroWinklerDistance(str1, str2);
System.out.println(“Distance
:” + distance);
Output : Distance : 0.75
Soundex : Phonetic algorithm for
indexing names based on their sounds. It gives you difference between their
encoding. It is available in Apache Commons Codec. This library can be
downloaded from :
Java code to get Soundex difference:
import org.apache.commons.codec.language.Soundex;
public class FuzzyLogic {
public static void
main(String[] args)
{
Soundex sd
= new Soundex();
System.out.println(“Distance : “sd.difference(str1,
str2));
}
Output: Distance : 2
FREJ : Fuzzy Regular Expression for
Java. It is a grep-like utility to get approximate string matching. This
library can be downloaded from :
Example code to run FREJ code:
import net.java.frej.fuzzy.*;
public class FuzzyLogic {
public static void
main(String[] args)
{
String str1 =
"Skynet";
String
str2 = "SkyHigh";
Fuzzy ds =
new Fuzzy();
System.out.println("Distance :"+ds.similarity(str1, str2));
}
Output: Distance : :0.6153846153846154