I have a set S of strings generated from DNA sequencing using a specific adapter fragment. This means that all the strings in S contain a suffix that approximately matches (due to sequencing errors) a prefix of the adapter sequence. How can I, given only the set S, infer the most likely adapter sequence used to generate S?
The set S is very large - roughly 1 million fragments, where each has a length of 50 characters. I know building a generalized suffix tree over the set S will greatly help in this problem, but I am unsure of a method to use to find the most likely adapter sequence.
Maybe this will suit your needs:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164228