Search code examples
pythonpandasrecursionlambdathreadpoolexecutor

Most efficient method to modify values within large dataframes - Python


Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.

Example Data:

    BLOCKID          PLACEFP
0    60014001001000  53000
1    60014001001001  53000
...
5844 60014099004021  53000
5845 60014100001000    
5846 60014100001001
5847 60014100001002  53000     

Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.

Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.

Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.

current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006], 
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})

for i in current_state_blockid_df.index:
    if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
        #Get value before blank
        prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
        next_place_fp = ''
        _n = 1

        # Find the end of the blank section
        while next_place_fp == '':
            next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
            if next_place_fp == '':
                _n += 1

        # if the blanks could likely be in the same city, assign them the city's place value
        if prior_place_fp == next_place_fp:
            for _i in range(1, _n):
                current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp

However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.


Solution

  • You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.

    df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1) 
    df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
    
    criteria1 = df['PLACEFP'].isnull()
    criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
    df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
    

    If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).

    for idx, row in df.itertuples():
        # logic goes here