Search code examples
javahashhashmapanagram

What should be the logic of hashfunction() in order to check that two strings are anagrams or not?


I want to write a function that takes string as a parameter and returns a number corresponding to that string.

Integer hashfunction(String a)
{    
    //logic    
}

Actually the question im solving is as follows :

Given an array of strings, return all groups of strings that are anagrams. Represent a group by a list of integers representing the index in the original list.

Input : cat dog god tca

Output : [[1, 4], [2, 3]]

Here is my implementation :-

public class Solution {
    Integer hashfunction(String a)
    {
        int i=0;int ans=0;
        for(i=0;i<a.length();i++)
        {
            ans+=(int)(a.charAt(i));//Adding all ASCII values    
        }

        return new Integer(ans);
    }
    **Obviously this approach is incorrect**
    public ArrayList<ArrayList<Integer>> anagrams(final List<String> a) {
        int i=0;
        HashMap<String,Integer> hashtable=new HashMap<String,Integer>();
        ArrayList<Integer> mylist=new ArrayList<Integer>();
        ArrayList<ArrayList<Integer>> answer=new  ArrayList<ArrayList<Integer>>(); 
        if(a.size()==1)
        {
            mylist.add(new Integer(1));
            answer.add(mylist);
            return answer;
        }

        int j=1;
        for(i=0;i<a.size()-1;i++)
        {   

            hashtable.put(a.get(i),hashfunction(a.get(i)));
            for(j=i+1;j<a.size();j++)
            {

                if(hashtable.containsValue(hashfunction(a.get(j))))
                {
                    mylist.add(new Integer(i+1));
                    mylist.add(new Integer(j+1));
                    answer.add(mylist);
                    mylist.clear();


                }
            }
        }
        return answer;
    }
}

Solution

  • Oh boy... there's quite a bit of stuff that's open for interpretation here. Case-sensitivity, locales, characters allowed/blacklisted... There are going to be a lot of ways to answer the general question. So, first, let me lay down a few assumptions:

    1. Case doesn't matter. ("Rat" is an anagram of "Tar", even with the capital lettering.)
    2. Locale is American English when it comes to the alphabet. (26 letters from A-Z. Compare this to Spanish, which has 28 IIRC, among which 'll' is considered a single letter and a potential consideration for Spanish anagrams!)
    3. Whitespace is ignored in our definition of an anagram. ("arthas menethil" is an anagram of "trash in a helmet" even though the number of whitespaces is different.)
    4. An empty string (null, 0-length, all white-space) has a "hash" (I prefer the term "digest", but a name is a name) of 1.

    If you don't like any of those assumptions, you can modify them as you wish. Of course, that will result in the following algorithm being slightly different, but they're a good set of guidelines that will make the general algorithm relatively easy to understand and refactor if you wish.

    Two strings are anagrams if they are exhaustively composed of the same set of characters and the same number of each included character. There's a lot of tools available in Java that makes this task fairly simple. We have String methods, Lists, Comparators, boxed primitives, and existing hashCode methods for... well, all of those. And we're going to use them to make our "hash" method.

    private static int hashString(String s) {
        if (s == null) return 0; // An empty/null string will return 0.
    
        List<Character> charList = new ArrayList<>(); 
        String lowercase = s.toLowerCase(); // This gets us around case sensitivity
    
        for (int i = 0; i < lowercase.length(); i++) {
            Character c = Character.valueOf(lowercase.charAt(i));
            if (Character.isWhitespace(c)) continue; // spaces don't count
            charList.add(c); // Note the character for future processing...
        }
    
        // Now we have a list of Characters... Sort it!
        Collections.sort(charList);
        return charList.hashCode(); // See contract of java.util.List#haschCode
    }
    

    And voila; you have a method that can digest a string and produce an integer representing it, regardless of the order of the characters within. You can use this as the basis for determining whether two strings are anagrams of each other... but I wouldn't. You asked for a digest function that produces an Integer, but keep in mind that in java, an Integer is merely a 32-bit value. This method can only produce about 4.2-billion unique values, and there are a whole lot more than 4.2-billion strings you can throw at it. This method can produce collisions and give you nonsensical results. If that's a problem, you might want to consider using BigInteger instead.

    private static BigInteger hashString(String s) {
        BigInteger THIRTY_ONE = BigInteger.valueOf(31); // You should promote this to a class constant!
    
        if (s == null) return BigInteger.ONE; // An empty/null string will return 1.
    
        BigInteger r = BigInteger.ONE; // The value of r will be returned by this method
        List<Character> charList = new ArrayList<>(); 
        String lowercase = s.toLowerCase(); // This gets us around case sensitivity
    
        for (int i = 0; i < lowercase.length(); i++) {
            Character c = Character.valueOf(lowercase.charAt(i));
            if (Character.isWhitespace(c)) continue; // spaces don't count
            charList.add(c); // Note the character for future processing...
        }
    
        // Now we have a list of Characters... Sort it!
        Collections.sort(charList);
    
        // Calculate our bighash, similar to how java's List interface does.
        for (Character c : charList) {
            int charHash = c.hashCode();
            r=r.multiply(THIRTY_ONE).add(BigInteger.valueOf(charHash));
        }
    
        return r;
    }