Search code examples
pythondataframeduplicatesanalysis

Find date and price difference between duplicate columns in a data frame. Append information to rows in data frame


Need to find date difference and price difference between duplicate columns (by name) for entire Data Frame which is much larger than 7 rows. If the duplicates can all be added to the same row that would be preferable as seen in example below.

Sample Code:

import pandas as pd
data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Hyundai'],
        'DateSold':['1-2-18','1-2-18','1-3-18','1-3-18','1-2-20','1-2-20','1-3-20','1-3-20'],
        'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}         
df = pd.DataFrame(data)

df['Price']=df['Price'].astype(str).astype(float)
df['DateSold']=pd.to_datetime(df['DateSold'])

To see data types:

df.dtypes

Expected Output:

   CarMake   DateSold    Price
0   Toyota 2018-01-02  20000.0
1     Ford 2018-01-02  21000.0
2   Nissan 2018-01-03  22000.0
3  Hyundai 2018-01-03  23000.0
4   Toyota 2020-01-02  15000.0
5     Ford 2020-01-02  16000.0
6   Nissan 2020-01-03  17000.0
7  Hyundai 2020-01-03  18000.0

Desired Output: It would be preferable to have the date in [months].

   CarMake DateSold DateSold2  Price Price2 PriceDifference DateDifference
0   Toyota   1-2-18    1-2-20  20000  15000           -5000             24
1     Ford   1-2-18    1-2-20  21000  16000           -5000             24
2   Nissan   1-3-18    1-3-20  22000  17000           -5000             24
3  Hyundai   1-3-18    1-3-20  23000  18000           -5000             24

Solution

  • You need to compare every row with every other row that has the same CarMake value, then add the comparisons to a new table. The most efficient way to do this is to first sort the list by CarMake and then iterate over all the rows. Once the CarMake field in the next row doesn't match the CarMake field in the current row, you know you have found all cars of that make and can change which row you are comparing to. This is much faster than comparing to all rows.

    Here is an example that does just this, as well as checking for singletons and entering them in the new table with no comparison data.

    # set up example: added a singleton and triple-duplicate example and varied dates more
    import pandas as pd
    data = {'CarMake':['Toyota', 'Ford', 'Nissan', 'Hyundai','Toyota', 'Ford', 'Nissan', 'Toyota'],
            'DateSold':['1-2-18','1-2-18','5-3-17','1-3-18','1-2-20','1-2-20','1-3-20','6-3-20'],
            'Price':['20000','21000','22000','23000','15000','16000','17000','18000']}         
    df = pd.DataFrame(data)
    
    df['Price']=df['Price'].astype(str).astype(float)
    df['DateSold']=pd.to_datetime(df['DateSold'])
    
    # sort our data
    df.sort_values('CarMake', inplace=True)
    # make list for building the new table
    new_data = []
    
    # loop through all rows once
    for one in range(len(df)):
        one_row = df.iloc[one]
        one_make = one_row['CarMake']
    
        # check for singleton (non-duplicate row)
        # if the previous make or next make match, there is a duplicate,
        # otherwise it is a singleton
        prev_make = df.iloc[one - 1]['CarMake'] if one > 0 else None
        next_make = df.iloc[one + 1]['CarMake'] if one < len(df) - 1 else None
        if one_make != prev_make and one_make != next_make:
            # found a singleton
            new_data.append({
                'CarMake': one_make,
                'DateSold': one_row['DateSold'],
                'DateSold2': None,
                'Price': one_row['Price'],
                'Price2': None,
                'PriceDifference': None,
                'DateDifference': None,
            })
            continue
        
        # there is at least one duplicate, find them all
        for two in range(one + 1, len(df)):
            two_row = df.iloc[two]
            two_make = two_row['CarMake']
    
            if one_make == two_make:
                # found a duplicate
                new_data.append({
                    'CarMake': one_make,
                    'DateSold': one_row['DateSold'],
                    'DateSold2': two_row['DateSold'],
                    'Price': one_row['Price'],
                    'Price2': two_row['Price'],
                    'PriceDifference': abs(one_row['Price'] - two_row['Price']),
                    'DateDifference': abs(
                        (one_row['DateSold'].year  - two_row['DateSold'].year) * 12 + 
                        (one_row['DateSold'].month - two_row['DateSold'].month)
                    ),
                })
            else:
                break # no more matches, move `one` forward
    
    new_df = pd.DataFrame(new_data)
    print(new_df)
    

    Output:

       CarMake   DateSold  DateSold2    Price   Price2  PriceDifference  DateDifference
    0     Ford 2018-01-02 2020-01-02  21000.0  16000.0           5000.0            24.0
    1  Hyundai 2018-01-03        NaT  23000.0      NaN              NaN             NaN
    2   Nissan 2017-05-03 2020-01-03  22000.0  17000.0           5000.0            32.0
    3   Toyota 2018-01-02 2020-01-02  20000.0  15000.0           5000.0            24.0
    4   Toyota 2018-01-02 2020-06-03  20000.0  18000.0           2000.0            29.0
    5   Toyota 2020-01-02 2020-06-03  15000.0  18000.0           3000.0             5.0