Search code examples
arraysalgorithmlanguage-agnosticpattern-recognitionsub-array

identify recurring/duplicate patterns as sub-arrays from a parent array


I have a typical pattern searching problem where I need to identify where multiple patterns are appearing within an array and single them out.

ex: ['horse', 'camel', 'horse', 'camel', 'tiger', 'horse', 'camel', 'horse', 'camel']

function should return

['horse', 'camel'], 
['horse', 'camel', 'horse'],
['camel', 'horse', 'camel'],
['horse', 'camel', 'horse', 'camel']

i.e. finding patterns that are repeating within an array which can become a sub-array,

Or the other way of defining is -> Find all the sub-arrays which are occurring more than 1 times in main array.

i.e. resulting arrays should have length > 1 ->

[1, 2, 3, 1, 2, 1, 4, 5] => [1,2,3] and [1,4,5] both are sub-arrays but [1,2,3] is recurring/repeating sub-array NOT [1,4,5]

Looking for a suitable efficient algorithm instead of brute-force looping solutions.


Solution

  • This probably isn't what you want but I don't know what you have tried yet so maybe it could be useful. Here's my direct approach which probably falls under your "brute-force looping solutions" but I figured give it a try since nobody has posted full answer.

    In java:

    // use this to not add duplicates to list
    static boolean contains (List<String[]> patterns, String[] pattern){
        for(String[] s: patterns)
            if (Arrays.equals(pattern,s)) return true;
        return false;
    }
    
    
    /**
     *
     * @param str String array containing all elements in your set
     * @param start index of subarray
     * @param end index of subarray
     * @return if subarray is a recurring pattern
     */
    static boolean search (String[] str,int start,int end) {
        // length of pattern
        int len = end - start + 1;
    
        // how many times you want pattern to
        // appear in text
        int n = 1;
    
        // increment m if pattern is matched
        int m = 0;
    
        // shift pattern down the array
        for (int i = end+1; i <= str.length - len; i++) {
            int j;
            for (j = 0; j < len; j++) {
                if (!str[i + j].equals(str[start + j]))
                    break;
            }
    
            // if pattern is matched at [i to i+len]
            if (j == len) {
                m++;
                if (m == n) return true;
            }
        }
        return false;
    }
    
    
    /**
     *
     * @param str String array containing all elements in your set
     * @return a list of subsets of input set which are a recurring pattern
     */
    static List<String[]> g (String[] str) {
        // put patterns in here
        List<String[]> patterns = new ArrayList<>();
    
        // iterate through all possible subarrays in str
        for(int i = 0; i < str.length-1; i++){
            for(int j = i + 1; j < str.length; j++){
    
                // if a pattern is found
                if (search(str,i,j)) {
                    int len = j-i+1;
                    String[] subarray = new String[len];
                    System.arraycopy(str,i,subarray,0,len);
                    if (!contains(patterns,subarray))
                        patterns.add(subarray);
    
                }
            }
        }
        return patterns;
    }
    
    public static void main(String[] args) {
    
        String[] str = {"horse", "camel", "horse", "camel", "tiger",
                        "horse", "camel", "horse", "camel"};
        // print out
        List<String[]> patterns = g(str);
        for (String[] s: patterns)
            System.out.println(Arrays.toString(s));
    }
    

    Output:

    [horse, camel]
    [horse, camel, horse]
    [horse, camel, horse, camel]
    [camel, horse]
    [camel, horse, camel]
    

    As mentioned in a comment i posted:

    "would [camel, horse] be included in the output?"

    The output I have goes with this as there are 2 instances of [camel, horse] at indices [1-2] and [6-7]. But maybe I am completely misunderstanding your question and I'm not understanding the constraints.

    As for optimizing, the search(...) method for example is just a simple substring search there are some more optimized ways of doing this e.g. Knuth–Morris–Pratt. Sorry if this was exactly what you didn't want but maybe there's some use