Search code examples
pythonmathmultiple-columnsmeannonetype

Calculate mean val from column excluding 'None' values


I have a large array of tab delimited data. I'd like to calculate the mean values for each column. The problem is some values are 'None' and I'd like to perform the calculation and exclude these data points.

The data structure looks like this:

0.0     0.5     0.0     0.142857142857  0.0     0.0
0.0     0.0     0.0     0.0             0.0     0.0
0.0     0.8     0.0     None            0.0     0.0

I'm using this code. Not sure how to add the condition into this:

data = [float(l.split('\t')[target_column_val]) \
           for l in open(target_file, 'r').readlines()]
mean = sum(data) / len(data)

Solution

  • open has a default mode of r or read. So, I do not add the r here in open. We get a file object from this as f. f is iterable, so we loop through all the lines in f.

    After we do so, we can split the line by spaces, so that we why we use for item in var.split() which gives us a list of strings, that have been been formed by splitting the line in f.

    We use if != 'None' because this is one way of getting rid of "None" values here. And in the end we append the float(item). because we want floats and not strings.

    with open('targe_file.txt') as f:
        final_list = [float(item) for var in f for item in var.split() if item != 'None']  # None is a string in this instance.
    
    print final_list
    

    Try the above code, you can add if statements to a list comprehension after the iterable.

    You can then calculate the mean like so:

    mean = sum(final_list) / len(final_list)
    

    We can use the sum function to add up all the floats in a list. The sum function takes in an iterable object, something like a list (our case) or a tuple. and len gves you the length of a list.