algorithm substring time-complexity binary-indexed-tree

Quick Way of Finding How many Substrings has first and last character repeated inside

This is a problem about substrings that I created. I am wondering how to implement an O(nlog(n)) solution to this problem because the naive approach is pretty easy. Here is how it goes. You have a string S. S has many substrings. In some substrings, the first character and last character are there more than once. Find how many substrings where the first and last character are there more than once.

Input: "ABCDCBE"
Expected output: 2
Explanation: "BCDCB" and "CDC" are two such substrings

That test case explanation only has "BCDCB" and "CDC" where first and last char are same.

There can be another case aside from the sample case with "ABABCAC" being the substring where the first character "A" appears 3 times and the last character "C" appears twice. "AAAABB" is also another substring.

"AAAAB" does not satisfy.

What I have learned that is O(nlog(n)) that might or might not contribute to solution is Binary Indexed Trees. Binary Indexed Trees can somehow be used to solve this. There is also sorting and binary search, but first I want to focus especially on Binary Indexed Trees.

I am looking for a space complexity of O(n log(n)) or better.

Also Characters are in UTF-16

Solution

The gist of my solution is as follows:

Iterate over the input array, and, for each position, compute the amount of 'valid' substrings that end on that position. The sum of these values is the total amount of valid substrings. We achieve this by counting the amount of valid starts to a substring, that come before the current position, using a Binary Indexed Tree.

Now for the full detail:

As we iterate over the array we think of the current element as the end of a substring, and we say that the positions that are a valid start are those such that its value appears again between it, and the position we are currently iterating over. (i.e. if the value at the start of a substring appears at least twice in it)

For example:

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7

The first 1 (at index 0) is a valid start, because there is another 1 (at index 4) after it, but before the current index (index 6).

Now, counting the amount of valid starts that come before the current index gives us something pretty close to what we wanted, except that we may grab some substrings that don't have two appearances of the last value of the substring (i.e. the one we are currently iterating over)

For example:

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7
                  ^--------^

Here, the 4 is marked as a valid start (because there is another 4 that comes after it), but the corresponding substring does not have two 3s.

To fix this, we shall only consider valid starts up to the previous appearance of the current value. (this means that the substring will contain both the current value, and its previous appearance, thus, the last element will be in the substring at least twice)

The pseudocode goes as follows:

fn solve(arr) {
  answer := 0
  for i from 1 to length(arr) {
    previous_index := find_previous(arr, i)

    if there is a previous_index {
      arr[previous_index].is_valid_start = true
      answer += count_valid_starts_up_to_and_including(arr, previous_index)
    }
  }
  return answer
}

To implement these operations efficiently, we use a hash table for looking up the previous position of a value, and a Binary Indexed Tree (BIT) to keep track of and count the valid positions.

Thus, a more fleshed out pseudocode would look like

fn solve(arr) {
  n := length(arr)

  prev := hash_table{}
  bit  := bit_indexed_tree{length = n}

  answer := 0
  for i from 1 to length(arr) {
    value := arr[i]
    previous_index := prev[value]

    if there is a previous_index {
      bit.update(previous_index, 1)
      answer += bit.query(previous_index)
    }

    prev[value] = i
  }
  return answer
}

Finally, since a pseudocode is not always enough, here is an implementation in C++, where the control flow is a bit munged, to ensure efficient usage of std::unordered_map (C++'s built-in hash table)

class Bit { 
    std::vector<int> m_data;
public:
    // initialize BIT of size `n` with all 0s
    Bit(int n);

    // add `value` to index `i`
    void update(int i, int value);

    // sum from index 0 to index `i` (inclusive)
    int query(int i);
};

long long solve (std::vector<int> const& arr) {
    int const n = arr.size();

    std::unordered_map<int, int> prev_index;
    Bit bit(n);

    long long answer = 0;
    int i = 0;
    for (int value : arr) {

        auto insert_result = prev_index.insert({value, i});
        if (!insert_result.second) { // there is a previous index
            int j = insert_result.first->second;

            bit.update(j, 1);
            answer += bit.query(j);

            insert_result.first->second = i;
        }

        ++i;
    }

    return answer;
}

EDIT: For transparency, here is the Fenwick tree implementation i used to test this code

struct Bit {
    std::vector<int> m_data;
    Bit(int n) : m_data(n+2, 0) { }
    int query(int i) {
        int res = 0;
        for(++i; i > 0; i -= i&-i) res += m_data[i];
        return res;
    }
    void update(int i, int x) {
        for(++i; i < m_data.size(); i += i&-i) m_data[i] += x;
    }
};