Search code examples
pythonpandasdataframerecursionvariable-assignment

How to create a dataframe column (with repetition allowed!) based on conditional statements involving another dataframe column?


I have two dataframes that each have datetime columns. df1 has repeated datetimes, df2 does not.

I'm trying to assign values from df2 to a new column in df1 at every place where the datetimes match, including at all repeated values in df1.

Now, I can do this by making nested for loops and iterating over the index values, but this is decidedly slow and apparently considered bad form with dataframes as it doesn't make use of their most excellent methods.

example (non-functional) code:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:03:00', '3/10/2000 17:04:00'],
                'value1': [2, 3, 4],
                'value2': ['x', 'y', 'z']})
df1['datetime'] = pd.to_datetime(df1['datetime'])
print('df1',df1)

df2 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:04:00', '3/10/2000 17:05:00'],
                'value1': [2, 3, 4],
                'value2': ['x', 'y', 'z'],
                'value3': ['p', 'd', 'q']})
df2['datetime'] = pd.to_datetime(df2['datetime'])
print('df2',df2)

df1.loc[:, 'value3'] = df2.loc[df1.loc[:, 'datetime'] == df2.loc[:, 'datetime'], 'value3']
print('modified df1',df1)

output:

df1              datetime  value1 value2
0 2000-03-10 17:03:00       2      x
1 2000-03-10 17:03:00       3      y
2 2000-03-10 17:04:00       4      z
df2              datetime  value1 value2 value3
0 2000-03-10 17:03:00       2      x      p
1 2000-03-10 17:04:00       3      y      d
2 2000-03-10 17:05:00       4      z      q
modified df1              datetime  value1 value2 value3
0 2000-03-10 17:03:00       2      x      p
1 2000-03-10 17:03:00       3      y    NaN
2 2000-03-10 17:04:00       4      z    NaN

Note the output (printed as "modified df1") has p, NaN, NaN in df1['value3'] when I need it to have p, p, d.

How to make this work? Is there some sort of recursive flag I can set or is this totally the wrong approach using .loc?

Again, I'm trying to avoid bad form (and Pandas warnings!) of doing this with nested for loops. Thank you all for the help.


Solution

  • Stef has the right idea in the comment.

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:03:00', '3/10/2000 17:04:00'],
                    'value1': [2, 3, 4],
                    'value2': ['x', 'y', 'z']})
    df1['datetime'] = pd.to_datetime(df1['datetime'])
    print('df1',df1)
    
    df2 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:04:00', '3/10/2000 17:05:00'],
                    'value1': [2, 3, 4],
                    'value2': ['x', 'y', 'z'],
                    'value3': ['p', 'd', 'q']})
    df2['datetime'] = pd.to_datetime(df2['datetime'])
    print('df2',df2)
    
    df1 = df1.merge(df2[["datetime", "value3"]], on = "datetime", suffixes=("", ""))
    print('modified df1',df1)