I have a large data file as shown below.
Edited to include an updated example:
I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.
I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!
This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.
import pandas as pd
import numpy as np
# dictionary of your data
companies = {
'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed.
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns
# Edit here to TEST the data types
for r in range(0, 5):
v = df['Comp ID'].values[r]
print(f'this "{v}" is a ', type(v))
# So this will tell us the data types so we can construct our where(). Back to prior answer....
# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
output
Comp ID | Type | Address | Suite | City/State | phone |
---|---|---|---|---|---|
C1 | W_sale | 10 foo | Suite A | foo city | 888-321-4567 |
C2 | W_sale | 11 spam | STE 100 | spam town | 888-321-4567 |
C3 | W_sale | 12 ham | Myhammy | 888-321-4567 |
Edit: the numpy where statement:
numpy is brought in by the line import numpy as np
at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.
df1(return value) = np.where(df[test] > condition, true, false)
The pandas backfill
Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill()
.