Search code examples
pythonpandasdataframeinsertconditional-statements

Insert rows in Python dataframe with conditions


I have a large data file as shown below. Example Data File

Edited to include an updated example:

Updated Example

I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.

I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!


Solution

  • This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.

    import pandas as pd
    import numpy as np
    
    # dictionary of your data
    companies = {
        'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
        'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
        'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
        'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
    }
    # make the frames needed. 
    df = pd.DataFrame( companies)
    df1 = pd.DataFrame() # blank frame for suite and town columns
    
    # Edit here to TEST the data types 
    for r in range(0, 5):
        v = df['Comp ID'].values[r]
        print(f'this "{v}" is a ', type(v))
    
    # So this will tell us the data types so we can construct our where(). Back to prior answer....
    
    # Need a where clause it is similar to a if() statement in excel
    df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
    df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
    # copy values to rows above
    df1 = df1[['Suite','City/State']].backfill()
    # joint the frames together on index
    df = df.join(df1)
    df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
    # set the column order to what you want
    df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
    

    output

    Comp ID Type Address Suite City/State phone
    C1 W_sale 10 foo Suite A foo city 888-321-4567
    C2 W_sale 11 spam STE 100 spam town 888-321-4567
    C3 W_sale 12 ham Myhammy 888-321-4567

    Edit: the numpy where statement:

    numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.

    df1(return value) = np.where(df[test] > condition, true, false)

    The pandas backfill Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().