Search code examples
pandasscipygraphlab

graphlab find all the columns that has at least one None value


How should one find all the columns in SFrame that has at least one None value in it? One way to do this would be to iterate through every column and check if any value in that column is None or not. Is there a better way to do the job?


Solution

  • To find None values in an SFrame use the SArray method num_missing (doc).

    Solution

    >>> col_w_none = [col for col in sf.column_names() if sf[col].num_missing()>0]
    

    Example

    >>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4]})
    >>> print sf
    +------+-----+
    | bar  | foo |
    +------+-----+
    |  1   |  1  |
    | None |  2  |
    |  3   |  3  |
    |  4   |  4  |
    +------+-----+
    [4 rows x 2 columns]
    >>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
    ['bar']
    

    Caveats

    • It isn't optimal since it won't stop to iterate at the first None value.
    • It won't detect NaN and empty string.
    >>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4], 'baz':[1,2,float('nan'),4], 'qux':['spam', '', 'ham', 'eggs']} )
    >>> print sf
    +------+-----+-----+------+
    | bar  | baz | foo | qux  |
    +------+-----+-----+------+
    |  1   | 1.0 |  1  | spam |
    | None | 2.0 |  2  |      |
    |  3   | nan |  3  | ham  |
    |  4   | 4.0 |  4  | eggs |
    +------+-----+-----+------+
    [4 rows x 4 columns]
    >>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
    ['bar']