I have a dataset of several tables. Some of the fields overlap but on some tables they may have a one to many relationship while on other tables they may have a one to one relationship. I am trying to create a new dataframe where I can take the values associated with one field (one to one) and the values associated with that same field but in another table (one to many) and have them all listed on the new dataframe (one to many).
One dataframe:
finishtId eventId instanceId ... value statusId finishType
0 1 18 1 ... 218.3 1 Positive
1 2 18 2 ... 217.586 1 Positive
2 3 18 3 ... 216.719 1 Positive
3 4 18 4 ... 215.464 1 Positive
4 5 18 5 ... 218.385 1 Negative
Another dataframe:
eventId instanceId red blue time duration milliseconds
0 841 153 1 1 17:05:23 26.898 26898
1 841 30 1 1 17:05:52 25.021 25021
2 841 17 1 11 17:20:48 23.426 23426
3 841 4 1 12 17:22:34 23.251 23251
4 841 13 1 13 17:24:10 23.842 23842
5. 841. 153. 2 45. 17:45:30. 24.786. 26473
... ... ... ... ... ... ...
7633 1036 822 2 48 16:20:38 22.143 22143
7634 1036 1 2 50 16:23:05 21.853 21853
7635 1036 849 2 49 16:24:00 22.475 22475
7636 1036 154 2 62 16:42:16 24.010 24010
7637 1036 822 3 64 16:42:47 22.607 22607
I want to create a new dataframe that appends all of the values from dataframe2 (red, blue, time, duration, milliseconds) to the instanceId and eventId fields so that dataframe1 shows the one to many relationships. Also I want to create a new field that tells me how many many reds per instanceId and eventId (numRed) Basically something like this:
eventId instanceId red numRed blue ... time duration value statusId finishType
0 841 153 1 2 17 ... 17:05:23 26.898 218.3 1 Positive
1 841 153 2 2 52 ... 17:45:30 24.786 217.586 1 Positive
1 841 146 1 1 40 ... 17:32:30 24.986 217.586 1 Negative
So essentially every red, blue, time, duration, value, statusId, and finishType is listed for every instanceId for every eventId. I'm new to Pandas so I've been digging through functions but I keep getting errors usually associated with data type (float vs str) etc...
UPDATE: After getting the solution from Edunne I realized that what I think would work better for the dataset is actually something else. I'd prefer to instead merge down the rows of 'red' values for each 'instanceId' for each 'eventId'. The values that are different would be averaged, so the mean of the 'duration' and the mean of the 'value' fields. Something looking like this:
eventId instanceId numRed ... duration value statusId finishType
0 841 153 2 ... 25.842 218.3 1 Positive
1 841 146 1 ... 24.986 217.586 1 Negative
You should show us what you tried! Makes it easier for people to answer.
Pandas Merge is how I'd approach it though. Something like:
new_df = df2.merge(df1, on=["eventID", "instanceId"], how="outer")
The new_df will contain all rows in df2 and any matching rows from df2.
You might run into issue if the datatypes for "eventID" or "instanceId" are different across the two dataframes, but that should be easy enough to address...
Edit Probably looking for group_by. You should perform the aggregation on the second dataframe before joining/merging with the other.
# Dictionary with keys as column names and values as the aggregation/summary method.
agg_dict = {
"duration": "mean",
"value": "mean"
}
group_by_columns = ["eventID", "instanceId"] # We'll get one row in output for each combination of these columns
new_df2 = df2.groupby(group_by_columns).agg(agg_dict).reset_index()
result = new_df2.merge(df1, on=["eventID", "instanceId"], how="outer")
Let me know how it goes!