Search code examples

Filter strings with high proportion of lowercase letters

I have quite a large df (50+ million) with one of the columns containing DNA sequences (1 DNA sequence per row). Some of these sequences contain a mix of lowercase and uppercase letters. I would like to have my dataset only have sequences with 50% or more uppercase letters (take out the seqs with 50% or more lowercase). I took a small subset of my DF and it took 2 minutes just to filter out the sequences. I was hoping that I could find a more efficient way so that I can scale up.

Example of my DF:

label    sequence
1        aaaggGtTt...
0        AAAggccCCC...

Here is the function I am using.

def remove_low_complexity_seqs(sequence, threshold=0.5):
    Check if more than a given threshold proportion of the sequence is lowercase (low complexity).
    - sequence (str): The nucleotide sequence.
    - threshold (float): The proportion threshold (default is 0.5 for 50%).
    - bool: True if more than threshold proportion is lowercase, otherwise False.
    lowercase_count = sum(map(str.islower, sequence))
    proportion = lowercase_count / (10000) #10k is the length of all seqs

    return proportion > threshold

Code I ran:

# mask = control_seqs['sequence'].apply(lambda seq: not remove_low_complexity_seqs(seq, context)) # long runtime 115secs
# control_seqs = control_seqs[mask] # quick runtime


  • Assuming there are only the letters "acgtACGT", these seem 10-30 times faster (version 4 being the fastest):

    Version 1:

    lowercase_count = sum(map(sequence.count, 'acgt'))

    Version 2:

    lowercase_count = sum(map(sequence.encode().count, b'acgt'))

    Version 3, with lower_to_a = bytes.maketrans(b'cgt', b'aaa') prepared once, before your function:

    lowercase_count = sequence.encode().translate(lower_to_a).count(b'a')

    Version 4, with del_upper = str.maketrans('', '', 'ACGT') prepared once, before your function:

    lowercase_count = len(sequence.translate(del_upper))