Search code examples
pythondataframenagraphlabsframe

Finding rows with "Not Applicable" value from a specific column from Graphlab SFrame


Given a Graphlab.SFrame object with the following column names:

>>> import graphlab
>>> sf = graphlab.SFrame.read_csv('some.csv')
>>> s.column_names()
['Dataset', 'Domain', 'Score', 'Sent1', 'Sent2']

One could easily drop the rows with "not applicable" (NA) / None value in a particular column, e.g. to drop rows with NA values for the "Score" column, I could do this:

>>> sf.dropna('Score')

Or to replace the None value with a certain value (let's say -1), I could do this:

>>> sf.fillna('Score', -1)

After checking the SFrame docs from https://dato.com/products/create/docs/generated/graphlab.SFrame.html, there isn't a built-in function to find the rows that contains None for a certain column, something like sf.findna('Score'). Or possibly I might have missed it.

If there is such a function, what is it called?

If there isn't how should I extract the rows where there's a specified column in that row with NA values?


Solution

  • I think you can use a boolean array to identify the rows with missing values for a given column.

    >>> import graphlab
    >>> sf = graphlab.SFrame({'a': [1, 2, None, 4],
    ...                       'b': [None, 3, 1, None]})
    >>> mask = sf['a'] == None
    >>> mask
    dtype: int
    Rows: 4
    [0, 0, 1, 0]