Search code examples
pythonpandasdataframepandas-groupby

How to subtract rows of a Pandas dataframe based upon some conditions?


I am performing analysis on this dataset.

After using the code below, I am left with the cleaned version of the data.

covid_df.drop(columns = ["Sno", "Time"], inplace = True)
covid_df["State/UnionTerritory"] = covid_df["State/UnionTerritory"].replace({
    "Bihar****": "Bihar",
    "Maharashtra***": "Maharashtra", 
    "Madhya Pradesh***": "Madhya Pradesh", 
    "Karanataka": "Karnataka",
    "Telangana": "Telengana",
    "Himanchal Pradesh": "Himachal Pradesh",
    "Dadra and Nagar Haveli": "Dadra and Nagar Haveli and Daman and Diu",
    "Daman & Diu": "Dadra and Nagar Haveli and Daman and Diu"
    })

invalid_states = ["Cases being reassigned to states", "Unassigned"]

for invalid_state in invalid_states:
  invalid_state_index = covid_df.loc[covid_df["State/UnionTerritory"] == invalid_state, :].index
  covid_df.drop(index = invalid_state_index, inplace = True)
covid_df = covid_df.groupby(["State/UnionTerritory", "Date"], as_index = False).sum()
covid_df["Date"] = pd.to_datetime(covid_df["Date"])
covid_df.sort_values(by = ["State/UnionTerritory", "Date"], inplace = True)

This cleaned data has the cumulative cases for each State/UnionTerritory for each Date. How can I extract the daily new cases for each State/UnionTerritory?

This is what I have tried:

daily_cases_data = [list(covid_df.iloc[0, 2:])]

for index in range(1, covid_df.shape[0]):
  previous_row = covid_df.iloc[index - 1, :]
  current_row = covid_df.iloc[index, :]

  if previous_row["State/UnionTerritory"] == current_row["State/UnionTerritory"]:
    daily_cases_data.append(list(current_row[2:] - previous_row[2:]))
  else:
    daily_cases_data.append(list(current_row[2:]))

Is there a more efficient way for doing this?


Solution

  • Edited answer: use groupby.shift As seen here: Use pandas.shift() within a group

    df = pd.DataFrame(
        {
            'state': ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c'],
            'cumul': [1, 2, 5, 3, 4, 5, 8, 9, 9]
        }
    )
    df['quantity'] = df['cumul'] - df.groupby('state')['cumul'].shift()
    

    Previous answer:

    You could use shift. For example:

    df = pd.DataFrame({'cumul': [0, 2, 3, 5, 7]})
    df['quantity'] = df - df.shift(1)
    

    quantity will be:

       quantity
    0    NaN
    1    2.0
    2    1.0
    3    2.0
    4    2.0
    

    You can then fillna or just change the zeroth value in quantity for the zeroth value in cumul.

    Edit: prepare the dataframe by applying your conditions first :-)