Search code examples
stringalgorithmdata-structurescombinationsenumeration

How to calculate the number of neighbors of a string with exact and at most d mismatches?


Given a string, and a set of four alphabets (A, B, C, D) for generating strings of length n. I need a generalized mathematical formula to calculate the number of neighbors for any string of length n with at most d mismatches, and the number of neighbors with exactly d mismatches.

For example: Given a string=”AAA” and d=3

  • We have 9 Strings with exactly d=1
    BAA  
    CAA  
    DAA  
    ABA  
    ACA  
    ADA  
    AAB   
    AAC   
    AAD  
  • We have 27 Strings with exactly d=2
BBA   BCA   BDA   
BAB   BAC   BAD
CBA   CCA   CDA   
CAB   CAC   CAD
DBA   DCA   DDA   
DAB   DAC   DAD
ABB   ABC   ABD
ACB   ACC   ACD
ADB   ADC   ADD
  • We have 27 Strings with exactly d=3
BBB   CBB   DBB   
BCB   CCB   DCB
BDB   CDB   DDB   
BBC   CBC   DBC
BCC   CCC   DCC    
BDC   CDC   DDC   
BBD   CBD   DBD    
BCD   CCD   DCD   
BDD   CDD   DDD
  • Number of Strings with at most d=3 are 9+27+27=63 strings

Solution

  • Let's consider a string of size n.
    We want to know how many 'neighbors' this string has, with a distance d. The first thing we remark, with your definition of 'distance', is that it means that we must choose d characters among the n of the string and modify them. So there are n choose d possible combinations of charactersto modify.
    Each of these can be modified in 3 different manners (since the size of the alphabet is 4.
    So ultimately, we have:

    • n choose d possible combinations of characters that will be modified
    • d characters will be modified, and each of them can be modified in 3different manners.

    So the formula is ultimately (s - 1) ^ d * (n choose d), where s is the size of the alphabet (here 4). I let you verify that it fits the first examples you provided.