Search code examples
pythonpandasdataframedata-analysis

Subtraction between rows with pandas


I'm trying to parse a csv file and print certain timeseries graphs.

About csv file: The csv file contains a lot of data from which I need to handle confirmedInfections of it based on the id inside a for loop. The csv file looks like that:

ID,name,date,confirmedInfections
DE2,BAYERN,2020-02-24,19
DE2,BAYERN,2020-02-25,19
DE2,BAYERN,2020-02-26,21
DE1,BADEN-WÃœRTTEMBERG,2020-02-24,1
DE1,BADEN-WÃœRTTEMBERG,2020-02-25,3
DE1,BADEN-WÃœRTTEMBERG,2020-02-26,7

From that file, I need to parse every row and subtract the confirmedInfections in order to find the daily infections. At this moment if I choose the city with ID DE2 and date 2020-03-01 the infections are a summation from the beggining until the day I have chosen. So, I need to edit this file in order to subtract and find daily cases to print proper time series. For that purpose of subtraction I found out pandas.shift(1) method but it seems that it doesn't work.

My code:

def main(file):
    id_array = ['DE2', 'DE1']
    df = pd.read_csv(file, header='infer', parse_dates=['date'])
    for key in id_array:
        if (df.loc[df['ID'] == key]):
            df['confirmedInfections'] = df['confirmedInfections']-df['confirmedInfections'].shift(1)
    print(df)


main('data.txt')

Desired output:

#For DE2
ID,name,date,confirmedInfections
DE2,BAYERN,2020-02-24,19
DE2,BAYERN,2020-02-25,0
DE2,BAYERN,2020-02-26,2

#For DE1
ID,name,date,confirmedInfections
DE1,BADEN-WÃœRTTEMBERG,2020-02-24,1
DE1,BADEN-WÃœRTTEMBERG,2020-02-25,2
DE1,BADEN-WÃœRTTEMBERG,2020-02-26,4

Error that I get:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any thoughts?


Solution

  • Considering df is read as:

    df = pd.read_csv(file, header='infer', parse_dates=['date'])
    

    Update "confirmedInfections" col as required diff:

    df["confirmedInfections"] = df.groupby('ID')['confirmedInfections'].diff().fillna(df.confirmedInfections).astype(int)
    

    Explanation Grouping the df by ID and updating rows 'confirmedInfections' column with difference in values between rows then updating the NaNs(first row) of each groupedID with original value. Since the calculations on 'confirmedInfections' resulted in float type hence typecasted the column to int. Finally splitting df into multiple dfs below.

    Split df in multiple dfs

    for id in df.ID.unique():
        print('DataFrame', id)
        print(df.groupby('ID').get_group(id))
    

    Output

    DataFrame DE2
        ID    name        date  confirmedInfections
    0  DE2  BAYERN  2020-02-24                   19
    1  DE2  BAYERN  2020-02-25                    0
    2  DE2  BAYERN  2020-02-26                    2
    DataFrame DE1
        ID               name        date  confirmedInfections
    3  DE1  BADEN-WÃRTTEMBERG  2020-02-24                    1
    4  DE1  BADEN-WÃRTTEMBERG  2020-02-25                    2
    5  DE1  BADEN-WÃRTTEMBERG  2020-02-26                    4