Search code examples
algorithmbioinformaticssuffix-tree

Infer adapter sequence from set of fragments


I have a set S of strings generated from DNA sequencing using a specific adapter fragment. This means that all the strings in S contain a suffix that approximately matches (due to sequencing errors) a prefix of the adapter sequence. How can I, given only the set S, infer the most likely adapter sequence used to generate S?

The set S is very large - roughly 1 million fragments, where each has a length of 50 characters. I know building a generalized suffix tree over the set S will greatly help in this problem, but I am unsure of a method to use to find the most likely adapter sequence.


Solution

  • Maybe this will suit your needs:

    http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164228