Search code examples
javadata-miningapriori

(Java) Generate all antecedents of an association rule


For example I may have a frequent itemset of characters {A, B, C, G}. I need to generate all possible antecendents of association rules. In this case: ABC, ABG, ACG, AB, AC, AG, BC, BG, CG, A, B, C, G. I have no idea where to start on doing this. Hours of research has taught me about the terminology and concept, but nothing has explained how to do this particular step. This is what I have so far for the method. The itemsets are all kept in the form of Strings, and stored together as an ArrayList. I have already made a working Apriori algorithm for the generation of frequent itemsets.

public static ArrayList<String> associationRules(ArrayList<String> data, ArrayList<String> freqItemsets, int minConf){
        ArrayList<String> generatedRules = new ArrayList<String>();
        for(int i = 0; i < freqItemsets.size(); i++) {
            String currentItemset = freqItemsets.get(i);
            if(currentItemset.length() < 2) {
                continue;
            }
            
        }
        
        
        return null; // temporary return statement to avoid compile error
    }

While code, feedback, and advice on this and later steps would of course be a huge help, all I really need is an English explanation of how to do this one step (as opposed to pseudocode or another working method using different data types). Everything else seems manageable.


Solution

  • Assuming you nailed the definition of what you actually need (all subsets that are sorted as the original list), you can do this by thinking about it as that and using those properties:

    • sorted like in your list
    • finite
    • dividable

    All you need to do is go through your character list multiple times and each time decide per chraracter, whether to include it this time or drop it. If you go through and capture all possibilities, then you are done. To do this you should find a solid way to count through the possible result strings.

    an iterative solution

    Think about possible bit-states. You have n characters and assign each character a bit (in your case 4). Then each possible bit-state defines a legal permutation of a subset, e.g. for {A, B, C, G}:

    1001 would be AG

    As we know, all possible states of a bit set are 'countable', or in other words, you can just count through them by counting from the least state to the highest by adding 1.

    Make a loop counting from 1 to 2^n - 1 (where n is the number of chars you have) and then build your String by adding (in correct sequence) all characters for which you have a 1 as their representing bit and leave out the characters with a 0. Then you 'count' through all possible legal permutations.

    Such an implementation highly depends on the programmer and their style, but for me it would look about like this:

    public static List<String> associationRules(List<String> elements) {
      List<String> result = new ArrayList<>();
      long limit = 1 << elements.size(); // thanks to saka1029 for this correction. My code was n^2 not 2^n.
    
      // count from 1 to n^2 - 1
      for (long i = 1; i < limit; ++i) {
        StringBuilder seq = new StringBuilder();
    
        // for each position (character) decide, whether to include it based on the state of the bit.
        for (int pos = 0; pos < elements.size(); ++pos) {
          boolean include = ((i >> pos) % 2) == 1; // this line will give you true, if the in 'i' the bit at 'pos' (from behind) is 1, and false otherwise.
          if (include) {
            seq.append(elements.get(pos));
          }
        }
    
        // add to the final result the newly generated String.
        result.add(seq.toString());
      }
    
      return result;
    }
    

    and the result looks like this: [A, B, AB, C, AC, BC, ABC, G, AG, BG, ABG, CG, ACG, BCG, ABCG]

    This is an iterative (non-recursive) solution, but there is also a recursive one that may (or may not) be easier to implement still.

    a recursive solution

    A recursive solution could just simply work by creating a method that takes as arguments the a set of sorted characters and a boolean state (included or not included) and returns a list of all possible sorted subpermutations. You would then call this with a public method that passes the characters and 0 as position and either true or false as initial state (the other comes later).

    The method then works with divide and conquer. You include the character at the defined position (based on whether or not the include flag is set) and call the own method again with a cloned character (sub)set that does not include the first character.

    Let's assume for the moment, that you start by not including the first character of each sequence (but later include it). If you pass to such a method the character set {A, B, C, G} then the method would (start) to operate like this:

    A: recurse on {B, C, G}
      B: recurse on {C, G}
        C: recurse on {G}
          G: set is empty,
          G: Add to the result all Strings with 'G' prefixed and without.
          G: return {"G", ""}
        C: Add to the result all Strings with 'C' prefixed and without.
        C: {"CG", "C", "G", ""}
        ...
    

    This way, you will collect all sorted subset permutations recursively. Depending on whether the empty String is allowed, you can remove that at the end, or not add it at all.

    I implemented it like this, but there are other correct ways:

    public static List<String> associationRules2(List<String> elements) {
        List<String> result = new ArrayList<>();
        String thisElement = elements.get(0);
        
        // build the subset list (leaving out the first element
        List<String> remaining = new ArrayList<>();
        boolean first = true;
        for (String s : elements) {
            if (first) {
                first = false;
            } else {
                remaining.add(s);
            }
        }
        
        // if the subset is not empty, we recurse.
        if (! remaining.isEmpty()) {
            List<String> subPermutations = associationRules2(remaining);
            
            // add all permutations without thisElement.
            result.addAll(subPermutations);
            
            // add all permutations *with* thisElement.
            for (String s : subPermutations) {
                result.add(thisElement + s);
            }
        }
        
        // finally add thisElement on it's own.
        result.add(thisElement);
        
        return result;
    }
    

    Result: [G, CG, C, BG, BCG, BC, B, AG, ACG, AC, ABG, ABCG, ABC, AB, A]