Search code examples
pythonpandasdataframedata-cleaning

Using iloc to replace a range of rows in the dataframe is not saved when assigning that index range to the original dataframe


I have a dataframe with the following structure:

date    position    quantity    Unit
0   2015-01-01 00:00:00 1   720 MAW

Based on the conditions I set I wish to reorder dates during the daylight savings time in October (1 hour gets added here) based on the "position" column. So I subseted the dataframe and got the index values back:

indexData = data[(data.date.dt.hour >= 2) & (data.date.dt.hour < 3) & (data.date.dt.day == 25) & (data.date.dt.month == 10) & (data.date.dt.year == 2015)].index

So when I do this I can see that the range is ordered correctly (but not yet replaced in the dataframe):

data.iloc[indexData[0]:(indexData[-1])+1].sort_values("position")

date    position    quantity    Unit
28521   2015-10-25 02:00:00 25305   420 MAW
28523   2015-10-25 02:15:00 25306   418 MAW
28524   2015-10-25 02:30:00 25307   415 MAW
28526   2015-10-25 02:45:00 25308   415 MAW
28520   2015-10-25 02:00:00 25309   410 MAW
28522   2015-10-25 02:15:00 25310   407 MAW
28525   2015-10-25 02:30:00 25311   399 MAW
28527   2015-10-25 02:45:00 25312   378 MAW

BUT when I try to assign this range to the original data drame it does not change:

data.iloc[indexData[0]:(indexData[-1])+1] = data.iloc[indexData[0]:(indexData[-1])+1].sort_values("position")

# here I check the range if updated
data[(data.date.dt.hour >= 2) & (data.date.dt.hour < 3) & (data.date.dt.day == 25) & (data.date.dt.month == 10) & (data.date.dt.year == 2015)]

date    position    quantity    Unit
28520   2015-10-25 02:00:00 25309   410 MAW
28521   2015-10-25 02:00:00 25305   420 MAW
28522   2015-10-25 02:15:00 25310   407 MAW
28523   2015-10-25 02:15:00 25306   418 MAW
28524   2015-10-25 02:30:00 25307   415 MAW
28525   2015-10-25 02:30:00 25311   399 MAW
28526   2015-10-25 02:45:00 25308   415 MAW
28527   2015-10-25 02:45:00 25312   378 MAW

Any ideas where I am going wrong or should try?


Solution

  • Let's assume we have data like the following and we want to sort by position rows only for October (basically we want to swap lines 4 and 5):

    date    position    quantity    Unit
    0   2015-09-25 02:15:00 25310   407 MAW
    1   2015-09-25 02:00:00 25309   410 MAW
    2   2015-09-25 02:30:00 25311   399 MAW
    3   2015-09-25 02:45:00 25312   378 MAW
    4   2015-10-25 02:15:00 25306   418 MAW
    5   2015-10-25 02:00:00 25305   420 MAW
    6   2015-10-25 02:30:00 25307   415 MAW
    7   2015-10-25 02:45:00 25308   415 MAW
    

    Choosing October with some other filters you applied (basically using your logic here):

    subset_index = data[(data.date.dt.hour >= 2) & (data.date.dt.hour < 3) & (data.date.dt.day == 25) & (data.date.dt.month == 10) & (data.date.dt.year == 2015)].index.values
    

    subset_index is array([4, 5, 6, 7]) - these are rows we want to sort.

    Having extracted index values to sort, you have two options:

    Option 1: keep the index and reassign values in the right order(right index).

    Just assign the values(numpy array) in the correct order to a dataframe subset:

    data.loc[subset_index] = data.loc[subset_index].sort_values("position").values
    

    The result is

    date    position    quantity    Unit
    0   2015-09-25 02:15:00 25310   407 MAW
    1   2015-09-25 02:00:00 25309   410 MAW
    2   2015-09-25 02:30:00 25311   399 MAW
    3   2015-09-25 02:45:00 25312   378 MAW
    4   2015-10-25 02:00:00 25305   420 MAW
    5   2015-10-25 02:15:00 25306   418 MAW
    6   2015-10-25 02:30:00 25307   415 MAW
    7   2015-10-25 02:45:00 25308   415 MAW
    

    As you can see only October rows have been sorted by position, September rows 0 and 1 were left unsorted.

    Option 2: sort the index in the right order and reindex the dataframe.

    Create a new sorted index order for this subset

    reordered_subset_index = data.loc[subset_index].sort_values("position").index.values
    

    reordered_subset_index is array([5, 4, 6, 7]). Now we need to create a correct index order for the whole dataframe:

    new_index = data.index.values
    new_index[subset_index] = new_index[reordered_subset_index]
    

    new_index is array([0, 1, 2, 3, 5, 4, 6, 7]). This index defines the correct order of rows in our initial dataframe.

    So now we can reindex our dataframe and just sort it by our new index:

    data.reindex(new_index).sort_index()
    

    The result is the same as in Option 1.