Search code examples
javastringsimilarity

Efficient Way to find most similar value


I have a value such as Color, and a list of String : {Colour,Color,Main Color, Main Colour, Theme, Brand, Subject ..... etc}

I would like to get the most similar string , except the searched string itself. In this example would expect to get Colour. (NOT Color)

I am sorting the list I am using the following rules and ranked the rules :

  1. Filter the same value
  2. check upper lower cases
  3. delete whitespaces. trim
  4. using Levenshtein distance
  5. String order : Main color = Color Main
  6. check for acronym : HP - Hewlett Packard

It takes a lot of time to go over a list of 1000 relevant candidates. Moreover I have lots of candidates to check.

Any other efficient way?

Original Code:

public static List findSimilarity(String word, List candidates) {
    List recommendations = new ArrayList();
    if (!word.equals("")) {
        for (String candidate : candidates) {
            if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces
                if (StringUtils.deleteWhitespace(word).equalsIgnoreCase(StringUtils.deleteWhitespace(candidate))) {
                    recommendations.add(candidate);
                }
                //2. same tokens diff order
                else if (candidate.split(" ").length == word.split("     ").length) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");
                    boolean status = true;
                    SortIgnoreCase icc = new SortIgnoreCase();
                    Arrays.sort(candidatearr, icc);
                    Arrays.sort(wordarr, icc);
                    for (int i = 0; i < candidatearr.length; i++) {
                        if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsIgnoreCase(candidatearr[i])))
                            status = false;
                    }

                    if (status) {
                        recommendations.add(candidate);
                    }
                }
            }
        }
        //3. distance between words
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");
                    //check for acronym
                    if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {
                        String acronym = "";
                        if (wordarr.length > candidatearr.length) {
                            for (String tmp : wordarr) {
                                if (!tmp.equals("")) {
                                    acronym = acronym + tmp.substring(0, 1);
                                }
                            }

                            if (acronym.equalsIgnoreCase(candidatearr[0])) {
                                recommendations.add(candidate);
                            }
                        } else {
                            for (String tmp : candidatearr) {
                                if (!tmp.equals("")) {
                                    acronym = acronym + tmp.substring(0, 1);
                                }
                            }

                            if (acronym.equalsIgnoreCase(wordarr[0])) {
                                recommendations.add(candidate);
                            }
                        }
                    }
                }
            }
        }

        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    int dist = 0;
                    String check = "";
                    if (word.length() > candidate.length()) {
                        check = candidate;
                    } else {
                        check = word;
                    }
                    if (check.length() <= 3) {
                        dist = 0;
                    } else if (check.length() > 3 && check.length() <= 5) {
                        dist = 1;
                    } else if (check.length() > 5) {
                        dist = 2;
                    }

                    if (StringUtils.getLevenshteinDistance(word, candidate) <= dist) {
                        //if(Levenshtein.distance(word,candidate) <= dist){
                        recommendations.add(candidate);
                    }
                }
            }
        }

        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");

                    for (String cand : candidatearr) {
                        for (String wor : wordarr) {
                            if (cand.equals(wor) && cand.length() > 4) {
                                recommendations.add(candidate);

                            }
                        }
                    }
                }
            }//for
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }

        //4. low priority - starts with
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    if (candidate.startsWith(word) || word.startsWith(candidate)) {
                        recommendations.add(candidate);
                    }
                }
            }
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }

        //5. low priority - contain word
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    if (candidate.contains(word) || word.contains(candidate)) {
                        recommendations.add(candidate);
                    }
                }
            }
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }
    }
    return recommendations;
}

Thanks, M.


Solution

  • Edited

    I wrapped the answer given by Bohemian into the context of your original code for your better understanding.

    The line .map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) splits multi-word terms, sorts, and joins again to eliminate permutations of same words. This is an answer to the permutational equality challenge of terms like "main color" and "color main".

    However, it does not make sense to catch all the business requirements of your task in the context of this question. By this answer you have got an outline of the solution. The problem of efficiency is addressed. You might need more stages in your pipeline, but that's a different story. The strength of the approach is that all stages are detached, so you can ask questions and seek help for each stage independently.

    public static String findSimilarity(String word, List<String> candidatesList) {
    
        // Populating the set with distinct values of the input terms
        Set<String> candidates = candidatesList.stream()
                .map(String::toLowerCase)
                .map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) // eliminates permutations
                .collect(Collectors.toSet());
    
        Map<String, Integer> cache = new ConcurrentHashMap<>();
    
        return candidates.parallelStream()
                .map(String::trim)
                        // add more mappers if needed
                .filter(s -> !s.equalsIgnoreCase(word))
                        // add more filters if needed
                .min((a, b) -> Integer.compare(
                        cache.computeIfAbsent(a, k -> getLevenshteinDistance(word, k)),
                        cache.computeIfAbsent(b, k -> getLevenshteinDistance(word, k))))
                .get(); // get the closest match
    }