Wednesday, October 7, 2015

Fuzzy String Matching in Java

Fuzzy String Matching: It is also referred as approximate string matching. It is useful where we want to search for approximate match between two sequences. There are number of ways we can do this. In this blog we will consider some JAVA libraries and code to use approximate string match. Practically it is useful in providing spelling suggestions, correcting human errors or searching over database when exact match is not available.

Levenshtein Distance (LD): It is also known as edit distance in computer science. It is referred as minimum number of insertion, deletion or replacement of characters needed to convert one string into another.
Example:
Beat – Best (LD = 1, replacement of 1 character e->s)
Beat – eat (LD = 1, deletion of 1 character ‘B’)

In Java Apache Commons Lang contains method to get Levenshtein Distance. First we need to download it’s jar file by this link
Java code to get LD :

import org.apache.commons.lang3.StringUtils;
public class FuzzyDistance {
          public static void main(String[] args)
          {
                   String str1 = “Skynet”;
                   String str2 = “SkyHigh”;
                   int distance = StringUtils.getLevenshteinDistance(str1, str2);
                   System.out.println(“Distance :” + distance);
          }
}

Output: Distance : 4

Jaro-Winker Distance: It is also a measure of similarity between two strings. The higher value of JWD between two strings indicate more similarity. It is based on the fact that difference at the start of the string sequences are more significant than end of the sequences. This algorithm is also available in Apache Common Lang.

String str1 = “Skynet”;
          String str2 = “SkyHigh”;
          int distance = StringUtils.getJaroWinklerDistance(str1, str2);
          System.out.println(“Distance :” + distance);

Output : Distance : 0.75

Soundex : Phonetic algorithm for indexing names based on their sounds. It gives you difference between their encoding. It is available in Apache Commons Codec. This library can be downloaded from :
Java code to get Soundex difference:
import org.apache.commons.codec.language.Soundex;
public class FuzzyLogic {
          public static void main(String[] args)
          {
                   Soundex sd = new Soundex();
                   System.out.println(“Distance : “sd.difference(str1, str2));
          }
Output: Distance : 2

FREJ : Fuzzy Regular Expression for Java. It is a grep-like utility to get approximate string matching. This library can be downloaded from :
Example code to run FREJ code:

import net.java.frej.fuzzy.*;
public class FuzzyLogic {
          public static void main(String[] args)
          {
String str1 = "Skynet";
                   String str2 = "SkyHigh";
                   Fuzzy ds = new Fuzzy();
                   System.out.println("Distance  :"+ds.similarity(str1, str2));
          }

Output: Distance : :0.6153846153846154