python regex bioinformatics dna-sequence

Optimize python pattern matching in nucleotide sequences

I'm currently working on a bioinformatic and modelling project where I need to do some pattern matching. Let's say I have a DNA fragment as follow 'atggcgtatagagc' and I split that fragment in micro-sequences of 8 nucleotides so that I have :

'atggcgta' 'tggcgtat' 'ggcgtata' 'gcgtatag' 'cgtataga' 'gtatagag' 'tatagagc'

And for each of these fragment I want to search in a whole genome and per chromosome the number of time they appear and the positions (starting positions) of the matches.

Here is how my code looks like :

you can download the genome fasta file here :

drive to the fasta file

import re
from Bio.SeqIO.FastaIO import FastaIterator
from Bio.Seq import Seq


def reverse_complement(sequence: str) -> str:
    my_sequence = Seq(sequence)
    return str(my_sequence.reverse_complement())



# you will need to unzip the file ant change the path below according to your working directory 
path = '../data/Genome_S288c.fa'
genome = open(path, "r")
chr_sequences = {}
for record in FastaIterator(genome):
    chr_id = record.id
    seq = str(record.seq).lower()
    rc_seq = reverse_complement(seq)
    chr_sequences[chr_id] = {'5to3': seq, '3to5': rc_seq}
genome.close()


sequences = 'ATGACTAACGAAAAGGTCTGGATAGAGAAGTTGGATAATCCAACTCTTTCAGTGTTACCACATGACTTTTTACGCCCACAATCTTTAT'.lower()
micro_size = 8
micro_sequences = []
start = micro_size - 1
for i in range(start, len(sequences), 1):
    current_micro_seq = sequences[i - start:i + 1]
    micro_sequences.append(current_micro_seq)

genome_count = 0
chr_count = {}
chr_locations = {}
micro_fragment_stats = {}
for ii_micro, micro_seq in enumerate(micro_sequences):
    for chr_idx in list(chr_sequences.keys()):
        chr_counter = 0
        seq = chr_sequences[chr_idx]['5to3']
        pos = [m.start() for m in re.finditer(pattern=r'(?=(' + micro_seq + '))', string=seq)]

        rc_seq = chr_sequences[chr_idx]['3to5']
        rc_pos = [m.start() for m in re.finditer(pattern=r'(?=(' + micro_seq + '))', string=rc_seq)]

        chr_locations[chr] = {'5to3': pos, '3to5': rc_pos}
        chr_counter += len(pos) + len(rc_pos)
        chr_count[chr_idx] = chr_counter
        genome_count += chr_counter

    micro_fragment_stats[ii_micro] = {'occurrences genome': genome_count,
                                      'occurrences chromosomes': chr_count,
                                      'locations chromosomes': chr_locations}

Actually my fragment is something like 2000bp long, so I took about 1 hour to compute all the micro-sequences. \

By the way, I use the r'(?=('+self.sequence+'))' to avoid the case of pattern that overlaps itself in the sequence, for instance :

pattern = 'aaggaaaaa' 
string = 'aaggaaaaaggaaaaa' 

expected output : (0, 7)

I am looking for a more efficient regex method that I can use for my case (in python if possible).

Thanks in advance

Solution

I've been playing with this question for a while and I end up with some ideas. The algorithm is mainly divided in two parts: k-mer generation and k-mer searching in the reference.

For the k-mer generation part, I can see that your algorithm is quick, but it generates duplicates (that you have to filter afterwards when generating the dictionary). My approach has been to generate a deduplicated list directly. In my sample code I also modified your method to perform the deduplication at the same time, so you can avoid doing it later and, more important, allows for a fair time comparison with my approach. You will see that using a set to keep the kmers offers us free deduplication, and is faster than using a list, as it has not to be traversed.

For the search of the kmer in the reference, given that you were doing exact searches, using a regex is overkill. It's far more cheaper to do a standard search. In this code, I used the methods provided by the Seq class: find and index. The idea is to find the first occurrence starting from the beginning, and the repeat the search starting with the next position after the last index found (if you want to avoid overlaps, then start after the last position found plus the k-mer size).

The code generated follows:

import re
from pathlib import Path
from timeit import timeit

from Bio.Seq import Seq
from Bio.SeqIO.FastaIO import FastaIterator


def reverse_complement(sequence: Seq) -> Seq:
    return sequence.reverse_complement()


def generate_kmers(sequence: Seq, kmer_size: int) -> set[Seq]:
    return {
        Seq(sequence[i : i + kmer_size]) for i in range(len(sequence) - kmer_size + 1)
    }


def generate_kmers_original(sequence: Seq, kmer_size: int) -> list[Seq]:
    kmers: list[Seq] = []
    start = kmer_size - 1
    for i in range(start, len(sequence), 1):
        current_micro_seq = Seq(sequence[i - start : i + 1])
        # We had to add this check to avoid the duplication of k-mers
        if current_micro_seq not in kmers:
            kmers.append(current_micro_seq)

    return kmers


def load_fasta(fasta_file: str) -> dict[str, dict[str, Seq]]:
    fasta_dict: dict[str, dict[str, Seq]] = {}

    with Path(fasta_file).open("r", encoding="UTF-8") as genome:
        for record in FastaIterator(genome):
            seq = record.seq.lower()
            fasta_dict[record.id] = {"5to3": seq, "3to5": reverse_complement(seq)}

    return fasta_dict


if __name__ == "__main__":
    # Load the big fasta file
    chr_sequences = load_fasta(
        ".../Saccharomyces_cerevisiae/S288c_R64/fasta/scerevisiae.S288c_R64.fasta"
    )

    # Generate the micro-sequences
    micro_size = 8
    sequences = Seq(
        "ATGACTAACGAAAAGGTCTGGATAGAGAAGTTGGATAATCCAACTCTTTCAGTGTTACCACATGACTTTTTACGCCCACAATCTTTAT"
    ).lower()

    micro_sequences = generate_kmers(sequences, micro_size)

    # k-mer generation benchmark
    test_size = 1000
    kmer_generation_time = timeit(
        "generate_kmers(sequences, micro_size)", number=test_size, globals=globals()
    )
    kmer_generation_original_time = timeit(
        "generate_kmers_original(sequences, micro_size)",
        number=test_size,
        globals=globals(),
    )

    print(f"New k-mer generation time     : {kmer_generation_time}")
    print(f"Original k-mer generation time: {kmer_generation_original_time}")

    print(f"There are {len(micro_sequences)} k-mers")

    # Search for the kmers in the reference
    def find_kmers_original(sequence: Seq, kmer: Seq) -> list[int]:
        positions = [
            m.start()
            for m in re.finditer(
                pattern=r"(?=(" + str(kmer) + "))", string=str(sequence)
            )
        ]

        return positions

    def find_kmers_find(sequence: Seq, kmer: Seq) -> list[int]:
        current = 0
        positions: list[int] = []
        while current < len(sequence):
            index = sequence.find(kmer, current)
            if index == -1:
                break

            positions.append(index)
            current = index + 1

        return positions

    def find_kmers_index(sequence: Seq, kmer: Seq) -> list[int]:
        positions: list[int] = []

        current = 0
        try:
            while True:
                index = sequence.index(kmer, current)

                positions.append(index)
                current = index + 1
        except ValueError:
            # Exception thrown when the kmer is not found
            # This is our exit condition
            pass

        return positions

    # k-mer search benchmark
    test_size = 1000
    haystack = next(iter(chr_sequences.values()))["5to3"]
    needle = next(iter(micro_sequences))
    search_original_time = timeit(
        "find_kmers_original(haystack, needle)",
        number=test_size,
        globals=globals(),
    )
    search_find_time = timeit(
        "find_kmers_find(haystack, needle)",
        number=test_size,
        globals=globals(),
    )
    search_index_time = timeit(
        "find_kmers_index(haystack, needle)",
        number=test_size,
        globals=globals(),
    )

    print(f"Search with original time: {search_original_time}")
    print(f"Search with find time    : {search_find_time}")
    print(f"Search with index time   : {search_index_time}")

    # Actual calculus
    genome_count = 0
    chr_count: dict[str, int] = {}
    chr_locations: dict[str, dict[str, list[int]]] = {}
    micro_fragment_stats: dict[
        int, dict[str, int | dict[str, int] | dict[str, dict[str, list[int]]]]
    ] = {}
    for ii_micro, micro_seq in enumerate(micro_sequences):
        for chr_counter, (chromosome, contents) in enumerate(chr_sequences.items()):
            pos = find_kmers_find(contents["5to3"], micro_seq)
            rc_pos = find_kmers_find(contents["3to5"], micro_seq)

            chr_locations[chromosome] = {"5to3": pos, "3to5": rc_pos}
            chr_counter += len(pos) + len(rc_pos)
            chr_count[chromosome] = chr_counter
            genome_count += chr_counter

        micro_fragment_stats[ii_micro] = {
            "occurrences genome": genome_count,
            "occurrences chromosomes": chr_count,
            "locations chromosomes": chr_locations,
        }

The output of this toy example is:

New k-mer generation time     : 0.6696164240129292
Original k-mer generation time: 5.967410315992311
There are 81 k-mers
Search with original time: 3.1360475399997085
Search with find time    : 0.5738343889825046
Search with index time   : 0.5662875371053815

You can see that the k-mer generation is 9x faster and the search without the regex is around 5.5x faster.

In general, you will be better taking advantage of comprehensions and built-in data types (like the sets used here). And using the more simple approach also helps with performance. Regexes are powerful, but they need their time; if they are not required, better to avoid them. Specially in loops, where every small performance change is amplified.

Besides all of this benchmarking, you can also try to add the approach introduced by @Ghothi where the long and short sequences are exchanged. Maybe it could lead to some further improvement.

As a side note, Seq.find and Seq.index seems to offer the same performance, but I find it cleaner and more elegant the Seq.index version: you don't need a weird value to test against and the code intent is clearer. Also, the performance is slightly better, as it is avoiding a comparison in the loop, but this is a very minor improvement.