I have a large array of tab delimited data. I'd like to calculate the mean values for each column. The problem is some values are 'None' and I'd like to perform the calculation and exclude these data points.
The data structure looks like this:
0.0 0.5 0.0 0.142857142857 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.8 0.0 None 0.0 0.0
I'm using this code. Not sure how to add the condition into this:
data = [float(l.split('\t')[target_column_val]) \
for l in open(target_file, 'r').readlines()]
mean = sum(data) / len(data)
open
has a default mode of r
or read
. So, I do not add the r
here in open
. We get a file object from this as f
. f
is iterable, so we loop through all the lines in f
.
After we do so, we can split the line by spaces, so that we why we use for item in var.split()
which gives us a list of strings, that have been been formed by splitting the line in f
.
We use if != 'None'
because this is one way of getting rid of "None"
values here. And in the end we append the float(item)
. because we want floats and not strings.
with open('targe_file.txt') as f:
final_list = [float(item) for var in f for item in var.split() if item != 'None'] # None is a string in this instance.
print final_list
Try the above code, you can add if statements to a list comprehension after the iterable.
You can then calculate the mean like so:
mean = sum(final_list) / len(final_list)
We can use the sum
function to add up all the floats in a list. The sum
function takes in an iterable object, something like a list
(our case) or a tuple
. and len
gves you the length of a list.