Search code examples
pythonstringlistreplaceprobability

Randomly changing letters in list of string based on probability


Given the following

data = ['AAACGGGATT\n','CTGTGTCAGT\n','AATCTCTACT\n']

For every letter in a string not including (\n), if its probability is greater than 5 (i.e. there is a 50% chance for a change), I'd like to replace the letter with a randomly selected letter from options (A,G,T,C), with the caveat that the replacement cannot be the same as original.

This is what I'm attempted thus far:

import random

def Applyerrors(string, string_length, probability):
    i=0
    while i < string_length:
        i = i + 1
        p = i/string_length
        if p > probability:
            new_var = string[i]
            options = ['A', 'G', 'T', 'C'] 
            [item.replace(new_var, '') for item in options]
            replacer = random.choice(options)
            [res.replace(new_var, replacer) for res in string]
        else:
            pass
        
# Testing
data_updated = [Applyerrors(unit, 10, 0.5) for unit in data]
data_updated

The result from this:

[None, None, None]

In addition to not getting the desired result, my probability doesn't make sense as I'm hoping to achieve 50% overall change in the data_updated file.

Any insight would be greatly appreciated.Thanks


Solution

  • The Problem

    There are a few problems that I can see right away.

    1. You are not returning anything in Applyerros, so the value in the loop [Applyerrors(unit, 10, 0.5) for unit in data] will be None every time.

    2. When you do [res.replace(new_var, replacer) for res in string] you are replacing every instance of a letter with another, so there would be a change 50% of the time, but the change would cover more than 50% of the data.

    3. When you do [item.replace(new_var, '') for item in options] you give replacer a chance to choose the empty string ('') as an option to replace a value, rather than removing the same value from the list of options.

    4. You increment i before using it, so it will always skip the first character of the string.

    5. You don't do a check to avoid changing the newline character.


    The Solution

    def Applyerrors(string, string_length, probability):
        i = 0
        while i < string_length:
            if string[i] == "\n":
                continue
            p = i/string_length
            if p > probability:
                new_var = string[i]
                options = ['A', 'G', 'T', 'C'] 
                options.remove(new_var)
                replacer = random.choice(options)
                string = string[:i] + replacer + string[i+1:]
                
            i = i + 1
        return string
    
    1. return string When Applyerrors is done, it returns the edited string at the end.

    2. string = string[:i] + replacer + string[i+1:] replaces just the character at index i. (string[:i] is everything from 0 up to, but not including i. string[i:] is everything from i to the end of the string`)

    3. options.remove(new_var) removes the value from the list entirely rather than replacing it with an empty string.

    4. i = i+1 was moved to the end of the loop to allow for i to be 0 for an iteration to include the first value.

    5. if string[i] == "\n" Added a check to skip any newline characters. (continue skip the rest of the current iteration of the loop)


    The Solution Continued

    While none of the following changes are necessary, I have recreated the function below to show some best practices.

    def apply_errors(data, prob):
        for i in range(len(data)):
            options = ['A', 'G', 'T', 'C']
            current = data[i]
            if current not in options or random.random() > prob:
                continue
            options.remove(current)
            data = data[:i] + random.choice(options) + data[i+1:]
        return data
    
    • for i in range(len(data)) removes the need for inputting the length of the string.
    • if current not in options or random.random() > prob
      • current not in options checks if the value is not present in options (so now it will ignore \n and anything else that shouldn't be changed)
      • random.random() > prob will now breakout of the iteration if the probability is greater than a random number (random.random() returns a random number between 0 and 1). This would make the input probability represent the chance that a value is changed. The way you have it currently, the input probability would be the probability that a value is not changed.
    • random.random() > prob makes it so that any value has the same probability of changing. The old version guarantees that the last 50% (when probability = 0.5) of the string will change.
    • rename string to data. Its not great to have a variable named the same as a data type (despite strings being represented as str in python).
    • The new function header apply_errors uses snake case, which is the preferred method of naming variables and functions in python.