Search code examples
matlabmismatchsubsequence

How to implement a mismatch kernel function in MATLAB?


Does anyone know how to find subsequence patterns that occurs with mismatches?

For mismatch kernel function, it allows m mismatches. For example, 'tool' has three 3-grams ('too', 'ool'), and mismatch kernel function will count 'aoo','boo',...,'zoo','tao'...'tzo','toa'...'toz',.... when the m is 1. For specific explanation, please see http://www.cs.columbia.edu/~cleslie/cs4761/papers/string-kernel-slides.pdf Just see page 12-17

How can I write a MATLAB function that could calculate this metric?

Thanks so much.


Solution

  • The solution I'm going to propose is actually based on a bruteforce approach (no fancy mismatch trees).
    If you're dealing with nucleotides you have just 4 symbols and that's fine, however for large alphabets and/or large values for k it might be quite memory-inefficient.
    The good news is though that it does not contain any loops and every statement is vectorized, so it works pretty fast.

    First of all, I suppose you know how to select k-mers; if not, the solution proposed here (function n_gram() by gnovice) works brilliantly.
    I don't like "n-gram", I prefer "k-mers" so I'll use that hereinafter to indicate substrings of length k.

    Second, let's declare some variables:

    m=1; k=3;
    alphabet='abcdefghijklmnopqrstuvwxyz';
    kmerUT='too';
    

    where m is the distance (as in the slides), alphabet is rather self-explanatory (is the set of all the possible values), kmerUT is the k-mer under test (i.e. the k-mer whose distance we want to calculate) and k is the length of the k-mer.

    Third, let's calculate all the possible combinations of k symbols from alphabet:

    C = cell(k, 1);                     %// Preallocate a cell array
    [C{:}] = ndgrid(alphabet);          %// Create K grids of values
    combs = cellfun(@(x){x(:)}, C);     %// Convert grids to column vectors
    combs = sortrows([combs{:}]);       %// Obtain all permutations
    

    This snippet, in order, preallocates a cell-array with k cells. After the second expression, in these 3 cells there will be the values from alphabet with "different period" (1st cell: abcdefh... ; 2nd cell has 26 times a, 26 times b and so on ; 3rd cell has 2*26 times a, 2*26 times b and so on...) with "26" because they are the letters of the alphabet. The final line unwraps the cell array C into a matrix and then sort the combinations in alphabetical order. Credits to Eithan T.
    Note: if you're running low on memory after this step (which is the most expensive, in terms of memory) you can as well remove the cell array C from main memory thanks to the command clear C. We won't need such variable from now on.

    Fourth, compare every row in combs with the target k-mer:

    fun=@(a,b) (a~=b);
    comparisons=bsxfun(fun,combs,kmerUT);
    

    so we declare a function fun that simply returns a logical vector with 1 in position j if a and b are different in position j and then apply (thanks to bsxfun()) such function to all the rows in combs. In other words we compare every row with the k-mer under test. comparisons will therefore be a logical matrix where ever row is the result of such comparisons.
    Note: combs is another variable which can potentially take a lot of memory. Since we won't need it from now on you can as well clear combs.

    Fifth, count the number of not-equal symbols for every tested combination:

    counter=sum(comparisons,2);
    

    having comparisons as a matrix of 1s and 0s, one can simply sum every row to get the number of 1s per row (i.e. the number of different symbols per row).
    Note: counteris a vector whose size is card(alphabet)^k where card(alphabet) is the number of values in alphabet, because it must contain values for all possible combinations. Such vector might as well be huge in some cases and taking into account that every item in such vector takes 8 Bytes, the whole vector might take a good amount of memory. Now if you've cleared C and combs it shouldn't be a problem, but just for the sake of completeness you might want to cast counter using an integer type that takes less then 8 Bytes per item. More infos about numeric types in Matlab can be found here.

    Sixth, count the number of combinations within m mismatches from kmerUT:

    result=sum(counter<=m);