Search code examples
pythonpandassubtraction

Working with Pandas Dataframe Subtraction and Indexes


I have several text files, with some chronological metrics on real estate data over time. I would like to import this data into a dataframe, and then calculate a new dataframe that is the difference between a metric for a specific city, relative to the average of all the cities, at that date. For a given date, I would also like the output to be NaN if the individual city had no data, even if there is a relevant average for the other cities.

A couple issues I've encountered or am concerned about

--> I am not sure how the best way is to avoid a situation where some Memphis data = 'NaN', but the average of multiple cities was 17 so Memphis - Average = -17. My desired behavior is for Memphis (NaN) minus Average (17) = NaN.

--> My column order changes to alphabetical when doing a subtraction of my average from the individual cities, and I don't know why. While I have manually checked, I am concerned I'm doing something wrong in terms of indexing, so that I may be subtracting misaligned columns or rows without knowing it.

--> I am extracting off some text data (a 'Record' string and the 'City' name) from my original dataset, and concatenating it back onto the finalized relative city. This seems a bit inefficient, and similarly, I am concerned this could be adding indexing risk.

import glob
import pandas as pd

df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True) for f in glob.glob('city Text Files/*.txt')], ignore_index = True)

df_headers = df.loc[:,['Record','City']]
df_average = df.groupby(['Date'], as_index=False).mean()

df_rel_cities_wip = (df.set_index('Date')-df_average.set_index('Date').reindex(df.Date)).reset_index()
df_rel_cities_wip = df_rel_cities_wip.drop(['Record','City'], axis = 1) #these otherwise stay behind as blank columns because they're strings
df_rel_cities_wip = df_rel_cities_wip.round(6) #Gets rid of floating point almost-zeros
df_rel_cities = pd.concat([df_headers, df_rel_cities_wip], axis=1, sort=False) #Bolts city names and alpha record numbers back to the relative dataset

EDIT - ADDING SOME SAMPLE DATA TO ILLUSTRATE First dataframe (df) would be the original records. For various reasons, there may not be viable $/sqft or days-on-market (DOM) available for Memphis on January 3.

Record  City    Date        $/SQFT  DOM
M12 Memphis 01/01/2018  100 18
M13 Memphis 01/02/2018  112 73
M14 Memphis 01/03/2018  NaN NaN
D73 Dallas  01/01/2018  300 36
D74 Dallas  01/02/2018  320 53
D75 Dallas  01/03/2018  325 43
A19 Atlanta 01/01/2018  255 11
A20 Atlanta 01/03/2018  263 18

The average metric on each day for all these cities is as follows (df_average)

Date        $/SQFT  DOM
01/01/2018  218.3   21.7
01/02/2018  216.0   63.0
01/03/2018  294.0   30.5

And my final dataframe (df_rel_cities) would show the difference between the 'city average' on each date, and the actual. Note that M14 Memphis is NaN to begin - so the desired outcome is to see NaN as the difference versus the industry, given that Memphis had problematic data that day. When I run my code above, I find it alphabetizes my column order as well.

Record  City    Date        $/SQFT  DOM
M12 Memphis 01/01/2018  -118.3  -3.7
M13 Memphis 01/02/2018  -104.0  10.0
M14 Memphis 01/03/2018  NaN NaN
D73 Dallas  01/01/2018  81.7    14.3
D74 Dallas  01/02/2018  104.0   -10.0
D75 Dallas  01/03/2018  31.0    12.5
A19 Atlanta 01/01/2018  36.7    -10.7
A20 Atlanta 01/03/2018  -31.0   -12.5

Solution

  • I believe need:

    df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True) 
                    for f in glob.glob('city Text Files/*.txt')])
    
    #get only numeric columns
    cols = df.select_dtypes(np.number).columns
    
    #create DataFrame with same size as original with means
    df_average = df.groupby('Date')[cols].transform('mean')
    print (df_average)
           $/SQFT        DOM
    0  218.333333  21.666667
    1  216.000000  63.000000
    2  294.000000  30.500000
    3  218.333333  21.666667
    4  216.000000  63.000000
    5  294.000000  30.500000
    6  218.333333  21.666667
    7  294.000000  30.500000
    
    #substract only numeric columns 
    df[cols] = (df[cols] - df_average ).round(1)
    print (df)
      Record     City        Date  $/SQFT   DOM
    0    M12  Memphis  01/01/2018  -118.3  -3.7
    1    M13  Memphis  01/02/2018  -104.0  10.0
    2    M14  Memphis  01/03/2018     NaN   NaN
    3    D73   Dallas  01/01/2018    81.7  14.3
    4    D74   Dallas  01/02/2018   104.0 -10.0
    5    D75   Dallas  01/03/2018    31.0  12.5
    6    A19  Atlanta  01/01/2018    36.7 -10.7
    7    A20  Atlanta  01/03/2018   -31.0 -12.5