Search code examples
pythonpandasdataframedata-analysisdata-cleaning

Pandas: How to filter on a parameter with both one to many and one to one relationships


I have a dataset of several tables. Some of the fields overlap but on some tables they may have a one to many relationship while on other tables they may have a one to one relationship. I am trying to create a new dataframe where I can take the values associated with one field (one to one) and the values associated with that same field but in another table (one to many) and have them all listed on the new dataframe (one to many).

One dataframe:

         finishtId eventId instanceId  ...       value statusId finishType
0               1     18        1  ...           218.3        1   Positive
1               2     18        2  ...         217.586        1   Positive
2               3     18        3  ...         216.719        1   Positive
3               4     18        4  ...         215.464        1   Positive
4               5     18        5  ...         218.385        1   Negative

Another dataframe:

      eventId  instanceId red blue     time duration  milliseconds
0        841       153     1    1  17:05:23   26.898         26898
1        841        30     1    1  17:05:52   25.021         25021
2        841        17     1   11  17:20:48   23.426         23426
3        841         4     1   12  17:22:34   23.251         23251
4        841        13     1   13  17:24:10   23.842         23842
5.       841.      153.    2   45. 17:45:30.  24.786.        26473
     ...       ...   ...  ...       ...      ...           ...
7633    1036       822     2   48  16:20:38   22.143         22143
7634    1036         1     2   50  16:23:05   21.853         21853
7635    1036       849     2   49  16:24:00   22.475         22475
7636    1036       154     2   62  16:42:16   24.010         24010
7637    1036       822     3   64  16:42:47   22.607         22607

I want to create a new dataframe that appends all of the values from dataframe2 (red, blue, time, duration, milliseconds) to the instanceId and eventId fields so that dataframe1 shows the one to many relationships. Also I want to create a new field that tells me how many many reds per instanceId and eventId (numRed) Basically something like this:

          eventId instanceId  red  numRed blue  ...     time  duration   value statusId finishType
0             841    153        1       2   17  ... 17:05:23    26.898   218.3        1   Positive
1             841    153        2       2   52  ... 17:45:30    24.786 217.586        1   Positive
1             841    146        1       1   40  ... 17:32:30    24.986 217.586        1   Negative

So essentially every red, blue, time, duration, value, statusId, and finishType is listed for every instanceId for every eventId. I'm new to Pandas so I've been digging through functions but I keep getting errors usually associated with data type (float vs str) etc...

UPDATE: After getting the solution from Edunne I realized that what I think would work better for the dataset is actually something else. I'd prefer to instead merge down the rows of 'red' values for each 'instanceId' for each 'eventId'. The values that are different would be averaged, so the mean of the 'duration' and the mean of the 'value' fields. Something looking like this:

          eventId instanceId  numRed ...  duration   value statusId finishType
0             841    153          2  ...     25.842   218.3        1   Positive
1             841    146          1  ...     24.986 217.586        1   Negative

Solution

  • You should show us what you tried! Makes it easier for people to answer.

    Pandas Merge is how I'd approach it though. Something like:

    new_df = df2.merge(df1, on=["eventID", "instanceId"], how="outer")
    

    The new_df will contain all rows in df2 and any matching rows from df2.

    You might run into issue if the datatypes for "eventID" or "instanceId" are different across the two dataframes, but that should be easy enough to address...

    Edit Probably looking for group_by. You should perform the aggregation on the second dataframe before joining/merging with the other.

    # Dictionary with keys as column names and values as the aggregation/summary method.
    agg_dict = {
        "duration": "mean",
        "value": "mean"
    }
    group_by_columns = ["eventID", "instanceId"] # We'll get one row in output for each combination of these columns
    new_df2 = df2.groupby(group_by_columns).agg(agg_dict).reset_index()
    
    result = new_df2.merge(df1, on=["eventID", "instanceId"], how="outer")
    

    Let me know how it goes!