Search code examples
pythonpandasmultiple-columns

Pandas: Create a new column with coulmn name and cell of matching string


enter image description here

I am searching through a large spreadsheet with 300 columns and over 200k rows. I would like to create a column that has the the column header and matching cell value. Some thing that looks like "Column||Value." I have the search term and the join aggregator. I can get the row index name but I'm struggling getting the matching column and specific cell. Here's me code so far

df = pd.read_excel (r"Test_file")

mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)

df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.

df['extract2'] = Column name

df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)

df.drop(['extract', 'extract2'], axis=1)

Final output should look something like Output


Solution

  • You can create a mask for a specific column first (I edited your 2nd line a bit), then create a new 'Match' column with all values initialized to 'No Match', and finally, change the values to your desired format ("Column||Value") for rows that are returned after applying the mask. I implemented this in the following sample code:

    def match_column(df, column_name):
        column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
        df['Match'] = 'No Match'
        df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
        return df
    
    df = {
        'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
        'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
    }
    df = pd.DataFrame(df)
    display(df)
    
    df = match_column(df, 'Segment')
    display(df)
    

    Output:

    enter image description here

    However, this only works for a single column. I don't know what output you want for cases when there are matches in multiple columns (if you can, please specify).

    UPDATE:

    If you want to use a list of columns as input and match with the first instance, you can use this instead:

    def match_first_column(df, column_list): 
        df['Match'] = 'No Match'
        # iterate over rows
        for index, row in df.iterrows():
            # iterate over column names
            for column_name in column_list:
                column_value = row[column_name]
                substrings = ['Chann', 'Midm', 'Fran']
                # if a match is found
                if any(x in column_value for x in substrings):
                    # add match string
                    df.loc[index, 'Match'] = column_name + ' || ' + column_value
                    # stop iterating and move to next row
                    break
        return df
    
    df = {
        'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
        'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
    }
    df = pd.DataFrame(df)
    display(df)
    
    column_list= df.columns.tolist() 
    match_first_column(df, column_list)
    

    Output:

    enter image description here