Search code examples
pythonpandasdataframepandas-groupbyrows

pandas combine rows based on conditions


Hi Guys i am working on a dataset containing following example:

enter image description here

the data contains start_time, end_time, id and url. for one id and url group i have different in and out values the problem is that in and out values are in different rows, i want to fill the missing end_time/start_time values. for this i have to use the following logic:

  1. if i have values in start_time and end time is null then i have to fill the end_time with the closest end_time considering end_time >= start_time and delete the used/matched row
  2. after all the rows having star_time are filled and used/matched rows are deleted, and still remain some rows with empty start_time, then i have to fill the start_time with the same value as end_time.
  3. if no matching end_time value is found for the given start_time then i have to fill the end_time value with the same start_time value.

considering the above things in mind the expected result should be similar as following, i am giving output in two stages so that its easy to understand

  1. fill the matching end_times with start_time and delete the used/matched rows: enter image description here

  2. final output fill the remaining start_time/end_time values: enter image description here

currently i am using the following way to achieve this but i feel its not optimized:

 def process(self, param, context):
    df = context['data']
    # df = df.drop_duplicates()
    key_cols = param['keys_cols']
    start_time_col = param['start_time_col']
    end_time_col = param['end_time_col']
    guid_col = param.get('guid_col','guid')
    df_groupby = df.groupby(key_cols).size().reset_index()
    final_dfs = []
    condition = ''
    for key in key_cols:
        if condition == '':
            condition = '(df[\''+str(key)+"\']==row[\'"+str(key)+"\'])"
        else:
            condition = condition + ' & ' +'(df[\'' + str(key) + "\']==row[\'" + str(key) + "\'])"
    for index, row in df_groupby.iterrows():
        sub_df = df[eval(condition)]
        if sub_df[start_time_col].isnull().sum() != len(sub_df[start_time_col]) and (sub_df[end_time_col].isnull().sum() != len(sub_df[end_time_col])):
            sub_df = sub_df.sort_values([start_time_col, end_time_col], ascending=True)
            subdf_start_time_not_null = sub_df[sub_df[start_time_col].notnull()]
            subdf_end_time_not_null = sub_df[sub_df[end_time_col].notnull()]
            subdf_end_time_not_null['combined'] = subdf_end_time_not_null[end_time_col] +"__"+ subdf_end_time_not_null[guid_col]
            end_time_values = subdf_end_time_not_null['combined'].values.tolist()
            for row_number, (stime_index, stime_row) in enumerate(subdf_start_time_not_null.iterrows()):
                delete_index = row_number
                if row_number < len(end_time_values):
                    end_time_value = np.nan
                    if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[row_number]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
                        end_time_value = end_time_values[row_number]
                        subdf_start_time_not_null.at[stime_index,end_time_col] = str(end_time_values[row_number]).split("__")[0]
                    else:
                        prev_index = end_time_values.index(end_time_values[row_number])
                        for end_time in end_time_values:
                            current_index = end_time_values.index(end_time)
                            if current_index > prev_index:
                                if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[current_index]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
                                    subdf_start_time_not_null.at[stime_index, end_time_col] = end_time_values[current_index]
                                    delete_index = current_index
                                    end_time_value = end_time_values.pop(delete_index)
                                    break
                    subdf_end_time_not_null = subdf_end_time_not_null[subdf_end_time_not_null[guid_col]!=end_time_value.split("__")[1]]
                else:
                    subdf_start_time_not_null.at[stime_index,end_time_col] = subdf_start_time_not_null.at[stime_index,start_time_col]
            subdf_end_time_not_null.drop('combined', axis=1, inplace=True)
            sub_df = pd.concat([subdf_start_time_not_null,subdf_end_time_not_null])
        sub_df[start_time_col] = np.where(sub_df[start_time_col].isnull(),sub_df[end_time_col],sub_df[start_time_col])
        sub_df[end_time_col] = np.where(sub_df[end_time_col].isnull(),sub_df[start_time_col],sub_df[end_time_col])
        final_dfs.append(sub_df)
        # LOGGER.info('do something' +str(index))
    df = pd.concat(final_dfs)
    context['data'] = df
    context['continue'] = True
    return context

where param is as following:

param = {"keys_cols":['id', 'url'], "start_time_col":"start_time","end_time_col":"end_time"}

and "df" is the data.

please help to review and suggest how to make it more optimized, i have more than 70000 rows of data with more than 12000 pairs of id and urls in one file

looking forward to you guys.

Thanks


Solution

  • If I understand the requirements correctly, we can do all of this within pandas. There are essentially two steps here:

    1. use pandas.merge_asof to fill in nearest end_date
    2. use drop_duplicates to remove out records we used in step 1
    text = StringIO(
        """
                 id                url type          start_time            end_time
    o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:23:28
    o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:27:34
    o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:23:52
    o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:23:37                 NaT
    o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:43:56                 NaT  # note: no record with `end_time` after this records `start_time`
    o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:10:29
    o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:25:00
    o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:15:49
    o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:33:37 2021-08-25T15:34:37  # additional already complete record
    """
    )
    df = pd.read_csv(text, delim_whitespace=True, parse_dates=["start_time", "end_time"], comment="#")
    
    # separate out unmatched `in` records and unmatched `out` records
    df_in_unmatched = (
        df[(df.type == "in") & ~df.start_time.isna() & df.end_time.isna()]
        .drop(columns=["end_time"])
        .sort_values("start_time")
    )
    df_out_unmatched = (
        df[(df.type == "out") & df.start_time.isna() & ~df.end_time.isna()]
        .drop(columns=["type", "start_time"])
        .sort_values("end_time")
    )
    
    # match `in` records to closest `out` record with `out.end_time` >= `in.start_time`
    df_in_matched = pd.merge_asof(
        df_in_unmatched,
        df_out_unmatched,
        by=["id", "url"],
        left_on="start_time",
        right_on="end_time",
        direction="forward",
        allow_exact_matches=True,
    )
    
    # fill in missing `end_time` for records with only `start_time`
    df_in_matched["end_time"] = df_in_matched["end_time"].combine_first(
        df_in_matched["start_time"]
    )
    
    # combine matched records with remaining unmatched and deduplicate
    # in order to remove "used" records
    df_matched = (
        pd.concat([df_in_matched, df_out_unmatched], ignore_index=True)
        .drop_duplicates(subset=["id", "url", "end_time"], keep="first")
        .dropna(subset=["end_time"])
        .fillna({"type": "out"})
    )
    
    # fill in missing `start_time` for records with only `end_time`
    df_matched["start_time"] = df_matched["start_time"].combine_first(
        df_matched["end_time"]
    )
    
    # combine matched records with unprocessed records: i.e. records
    # that had both `start_time` and `end_time` (if extant)
    df_final = pd.concat(
        [df_matched, df.dropna(subset=["start_time", "end_time"])], ignore_index=True
    )
    

    Result:

                   id               url type         start_time             end_time
    0 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:23:37 2021-08-25 15:23:52
    1 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:43:56 2021-08-25 15:43:56
    2 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:10:29 2021-08-25 15:10:29
    3 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:15:49 2021-08-25 15:15:49
    4 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:23:28 2021-08-25 15:23:28
    5 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:25:00 2021-08-25 15:25:00
    6 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:27:34 2021-08-25 15:27:34
    7 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:33:37 2021-08-25 15:34:37