Does anyone know how to find subsequence patterns that occurs with mismatches?
For mismatch kernel function, it allows m mismatches. For example, 'tool' has three 3-grams ('too', 'ool'), and mismatch kernel function will count 'aoo','boo',...,'zoo','tao'...'tzo','toa'...'toz',.... when the m is 1. For specific explanation, please see http://www.cs.columbia.edu/~cleslie/cs4761/papers/string-kernel-slides.pdf Just see page 12-17
How can I write a MATLAB function that could calculate this metric?
Thanks so much.
The solution I'm going to propose is actually based on a bruteforce approach (no fancy mismatch trees).
If you're dealing with nucleotides you have just 4 symbols and that's fine, however for large alphabets and/or large values for k
it might be quite memory-inefficient.
The good news is though that it does not contain any loops and every statement is vectorized, so it works pretty fast.
First of all, I suppose you know how to select k-mers; if not, the solution proposed here (function n_gram()
by gnovice) works brilliantly.
I don't like "n-gram", I prefer "k-mers" so I'll use that hereinafter to indicate substrings of length k.
Second, let's declare some variables:
m=1; k=3;
alphabet='abcdefghijklmnopqrstuvwxyz';
kmerUT='too';
where m
is the distance (as in the slides), alphabet
is rather self-explanatory (is the set of all the possible values), kmerUT
is the k-mer under test (i.e. the k-mer whose distance we want to calculate) and k
is the length of the k-mer.
Third, let's calculate all the possible combinations of k
symbols from alphabet
:
C = cell(k, 1); %// Preallocate a cell array
[C{:}] = ndgrid(alphabet); %// Create K grids of values
combs = cellfun(@(x){x(:)}, C); %// Convert grids to column vectors
combs = sortrows([combs{:}]); %// Obtain all permutations
This snippet, in order, preallocates a cell-array with k
cells. After the second expression, in these 3 cells there will be the values from alphabet
with "different period" (1st cell: abcdefh... ; 2nd cell has 26 times a, 26 times b and so on ; 3rd cell has 2*26 times a, 2*26 times b and so on...) with "26" because they are the letters of the alphabet. The final line unwraps the cell array C
into a matrix and then sort the combinations in alphabetical order. Credits to Eithan T.
Note: if you're running low on memory after this step (which is the most expensive, in terms of memory) you can as well remove the cell array C
from main memory thanks to the command clear C
. We won't need such variable from now on.
Fourth, compare every row in combs
with the target k-mer:
fun=@(a,b) (a~=b);
comparisons=bsxfun(fun,combs,kmerUT);
so we declare a function fun
that simply returns a logical vector with 1
in position j if a
and b
are different in position j and then apply (thanks to bsxfun()
) such function to all the rows in combs
. In other words we compare every row with the k-mer under test. comparisons
will therefore be a logical matrix where ever row is the result of such comparisons.
Note: combs
is another variable which can potentially take a lot of memory. Since we won't need it from now on you can as well clear combs
.
Fifth, count the number of not-equal symbols for every tested combination:
counter=sum(comparisons,2);
having comparisons
as a matrix of 1s and 0s, one can simply sum every row to get the number of 1s per row (i.e. the number of different symbols per row).
Note: counter
is a vector whose size is card(alphabet)^k where card(alphabet) is the number of values in alphabet
, because it must contain values for all possible combinations. Such vector might as well be huge in some cases and taking into account that every item in such vector takes 8 Bytes, the whole vector might take a good amount of memory. Now if you've cleared C
and combs
it shouldn't be a problem, but just for the sake of completeness you might want to cast counter
using an integer type that takes less then 8 Bytes per item. More infos about numeric types in Matlab can be found here.
Sixth, count the number of combinations within m
mismatches from kmerUT
:
result=sum(counter<=m);