Given the following
data = ['AAACGGGATT\n','CTGTGTCAGT\n','AATCTCTACT\n']
For every letter in a string not including (\n), if its probability is greater than 5 (i.e. there is a 50% chance for a change), I'd like to replace the letter with a randomly selected letter from options (A,G,T,C), with the caveat that the replacement cannot be the same as original.
This is what I'm attempted thus far:
import random
def Applyerrors(string, string_length, probability):
i=0
while i < string_length:
i = i + 1
p = i/string_length
if p > probability:
new_var = string[i]
options = ['A', 'G', 'T', 'C']
[item.replace(new_var, '') for item in options]
replacer = random.choice(options)
[res.replace(new_var, replacer) for res in string]
else:
pass
# Testing
data_updated = [Applyerrors(unit, 10, 0.5) for unit in data]
data_updated
The result from this:
[None, None, None]
In addition to not getting the desired result, my probability doesn't make sense as I'm hoping to achieve 50% overall change in the data_updated file.
Any insight would be greatly appreciated.Thanks
There are a few problems that I can see right away.
You are not returning anything in Applyerros
, so the value in the loop [Applyerrors(unit, 10, 0.5) for unit in data]
will be None
every time.
When you do [res.replace(new_var, replacer) for res in string]
you are replacing every instance of a letter with another, so there would be a change 50% of the time, but the change would cover more than 50% of the data.
When you do [item.replace(new_var, '') for item in options]
you give replacer
a chance to choose the empty string (''
) as an option to replace a value, rather than removing the same value from the list of options.
You increment i before using it, so it will always skip the first character of the string.
You don't do a check to avoid changing the newline character.
def Applyerrors(string, string_length, probability):
i = 0
while i < string_length:
if string[i] == "\n":
continue
p = i/string_length
if p > probability:
new_var = string[i]
options = ['A', 'G', 'T', 'C']
options.remove(new_var)
replacer = random.choice(options)
string = string[:i] + replacer + string[i+1:]
i = i + 1
return string
return string
When Applyerrors
is done, it returns the edited string at the end.
string = string[:i] + replacer + string[i+1:]
replaces just the character at index i
. (string[:i]
is everything from 0 up to, but not including i. string[i:]
is everything from i to the end of the string`)
options.remove(new_var)
removes the value from the list entirely rather than replacing it with an empty string.
i = i+1
was moved to the end of the loop to allow for i to be 0 for an iteration to include the first value.
if string[i] == "\n"
Added a check to skip any newline characters. (continue
skip the rest of the current iteration of the loop)
While none of the following changes are necessary, I have recreated the function below to show some best practices.
def apply_errors(data, prob):
for i in range(len(data)):
options = ['A', 'G', 'T', 'C']
current = data[i]
if current not in options or random.random() > prob:
continue
options.remove(current)
data = data[:i] + random.choice(options) + data[i+1:]
return data
for i in range(len(data))
removes the need for inputting the length of the string.if current not in options or random.random() > prob
current not in options
checks if the value is not present in options (so now it will ignore \n
and anything else that shouldn't be changed)random.random() > prob
will now breakout of the iteration if the probability is greater than a random number (random.random()
returns a random number between 0 and 1). This would make the input probability represent the chance that a value is changed. The way you have it currently, the input probability would be the probability that a value is not changed.random.random() > prob
makes it so that any value has the same probability of changing. The old version guarantees that the last 50% (when probability = 0.5) of the string will change.string
to data
. Its not great to have a variable named the same as a data type (despite strings being represented as str
in python).apply_errors
uses snake case, which is the preferred method of naming variables and functions in python.