I have several text files, with some chronological metrics on real estate data over time. I would like to import this data into a dataframe, and then calculate a new dataframe that is the difference between a metric for a specific city, relative to the average of all the cities, at that date. For a given date, I would also like the output to be NaN if the individual city had no data, even if there is a relevant average for the other cities.
A couple issues I've encountered or am concerned about
--> I am not sure how the best way is to avoid a situation where some Memphis data = 'NaN', but the average of multiple cities was 17 so Memphis - Average = -17. My desired behavior is for Memphis (NaN) minus Average (17) = NaN.
--> My column order changes to alphabetical when doing a subtraction of my average from the individual cities, and I don't know why. While I have manually checked, I am concerned I'm doing something wrong in terms of indexing, so that I may be subtracting misaligned columns or rows without knowing it.
--> I am extracting off some text data (a 'Record' string and the 'City' name) from my original dataset, and concatenating it back onto the finalized relative city. This seems a bit inefficient, and similarly, I am concerned this could be adding indexing risk.
import glob
import pandas as pd
df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True) for f in glob.glob('city Text Files/*.txt')], ignore_index = True)
df_headers = df.loc[:,['Record','City']]
df_average = df.groupby(['Date'], as_index=False).mean()
df_rel_cities_wip = (df.set_index('Date')-df_average.set_index('Date').reindex(df.Date)).reset_index()
df_rel_cities_wip = df_rel_cities_wip.drop(['Record','City'], axis = 1) #these otherwise stay behind as blank columns because they're strings
df_rel_cities_wip = df_rel_cities_wip.round(6) #Gets rid of floating point almost-zeros
df_rel_cities = pd.concat([df_headers, df_rel_cities_wip], axis=1, sort=False) #Bolts city names and alpha record numbers back to the relative dataset
EDIT - ADDING SOME SAMPLE DATA TO ILLUSTRATE First dataframe (df) would be the original records. For various reasons, there may not be viable $/sqft or days-on-market (DOM) available for Memphis on January 3.
Record City Date $/SQFT DOM
M12 Memphis 01/01/2018 100 18
M13 Memphis 01/02/2018 112 73
M14 Memphis 01/03/2018 NaN NaN
D73 Dallas 01/01/2018 300 36
D74 Dallas 01/02/2018 320 53
D75 Dallas 01/03/2018 325 43
A19 Atlanta 01/01/2018 255 11
A20 Atlanta 01/03/2018 263 18
The average metric on each day for all these cities is as follows (df_average)
Date $/SQFT DOM
01/01/2018 218.3 21.7
01/02/2018 216.0 63.0
01/03/2018 294.0 30.5
And my final dataframe (df_rel_cities) would show the difference between the 'city average' on each date, and the actual. Note that M14 Memphis is NaN to begin - so the desired outcome is to see NaN as the difference versus the industry, given that Memphis had problematic data that day. When I run my code above, I find it alphabetizes my column order as well.
Record City Date $/SQFT DOM
M12 Memphis 01/01/2018 -118.3 -3.7
M13 Memphis 01/02/2018 -104.0 10.0
M14 Memphis 01/03/2018 NaN NaN
D73 Dallas 01/01/2018 81.7 14.3
D74 Dallas 01/02/2018 104.0 -10.0
D75 Dallas 01/03/2018 31.0 12.5
A19 Atlanta 01/01/2018 36.7 -10.7
A20 Atlanta 01/03/2018 -31.0 -12.5
I believe need:
df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True)
for f in glob.glob('city Text Files/*.txt')])
#get only numeric columns
cols = df.select_dtypes(np.number).columns
#create DataFrame with same size as original with means
df_average = df.groupby('Date')[cols].transform('mean')
print (df_average)
$/SQFT DOM
0 218.333333 21.666667
1 216.000000 63.000000
2 294.000000 30.500000
3 218.333333 21.666667
4 216.000000 63.000000
5 294.000000 30.500000
6 218.333333 21.666667
7 294.000000 30.500000
#substract only numeric columns
df[cols] = (df[cols] - df_average ).round(1)
print (df)
Record City Date $/SQFT DOM
0 M12 Memphis 01/01/2018 -118.3 -3.7
1 M13 Memphis 01/02/2018 -104.0 10.0
2 M14 Memphis 01/03/2018 NaN NaN
3 D73 Dallas 01/01/2018 81.7 14.3
4 D74 Dallas 01/02/2018 104.0 -10.0
5 D75 Dallas 01/03/2018 31.0 12.5
6 A19 Atlanta 01/01/2018 36.7 -10.7
7 A20 Atlanta 01/03/2018 -31.0 -12.5