Search code examples
pythonpandaspython-itertools

check combinations of columns in a DF to return unique rows


for a, b in itertools.combinations(number_of_notes_cols, 2):
    weekly_meetings_difference = all_meetings_data[(all_meetings_data[a] != all_meetings_data[b]) == True]

The code above used to work: it would return all the rows of all the combinations of pairs of weekly_meetings_difference's columns where the column values (if this was true for any pair of columns). Now, returning weekly_meetings_difference gives me some, but not all, of the rows where the column values changed.


Edit with some code:

Before (when everything seemed to be working fine):

Number of Notes 03112016    Number of Notes 03192016    Number of Notes 03272016    Number of Notes 04042016
Meeting Name                
X      12.0 NaN NaN NaN
Y       5.0 5.0 NaN NaN
Z       2.0 NaN NaN NaN
W       NaN 6.0 713.0 740.0

After (now that I've updated the original dataframe from which I want information):

Number of Notes 03112016    Number of Notes 03192016    Number of Notes 03272016    Number of Notes 04042016    Number of Notes 04122016    Emails 04122016
Meeting Name                        
A   37.0 37.0 38.0 38.0 37.0
X   12.0 NaN NaN NaN NaN NaN
Y   5.0  5.0 NaN NaN NaN NaN
Z   2.0  NaN NaN NaN NaN NaN

Now that I've done this edit, I am noticing row A was added after adding the extra column to the dataframe as well as row W being removed (they both should show each time)


Solution

  • First, let me make sure that I understand the problem. Are you looking for rows in a dataframe that have more than one unique value? That is, the value changes at least one time in the row.

    import pandas as pd
    df = pd.DataFrame({'a': [1, 1, 1], 'b': [1, 2, 3], 'c': [1, 1, 3]})
    
        a  b  c
    0|  1  1  1
    1|  1  2  1
    2|  1  3  3
    

    In the dataframe above, you would want rows 1 and 2. If so, I would do something like:

    df.apply(pd.Series.nunique, axis=1)
    

    Which returns the number of unique values in each row of the dataframe:

    0    1
    1    2
    2    2
    dtype: int64
    

    Using that result, we can select the rows we care about with:

    df[df.apply(pd.Series.nunique, axis=1) > 1]
    

    Which returns the expected:

        a  b  c
    1|  1  2  1
    2|  1  3  3
    

    Is this what you're after, or is it something else? Happy to edit if you clarify.