Search code examples
pythonexcelpandasdataframestring-search

pandas, python, excel, search for substring in column of df 1 to write string to column in df2


I am using the package pandas in python to work with and read and write to excel spreadsheets. I have created 2 different dataframes (df1 and df2) that have cells that are all of the data type string. df1 has over 50,000 rows. There are many cells in each column of df1 that were “Nan”and I have converted to to a string that says “Empty”. df2 has over 9000 rows. Every row in “WHSE_Nbr” and “WHSE_Desc_HR” contains an accurate string value. Only some rows have values other than the string “Empty” in the last 2 columns of df2. The “Warehouse” column in df1 has many cells containing names with only words. The rows of the “Warehouse” column in df1 that I am interested in identifying are the ones that contain any of the warehouse numbers that are found in df2 in the “WHSE_Nbr” column.

Example of dataframe1 - df1
Job         Warehouse          GeneralDescription      Purpose
Empty       AP                 Accounts Payable        Accounting
Empty       Empty              Empty                   Empty
Empty       Cyber Security GA  Security & Compliance   Data Security
Empty       Merch|04-1854      Empty                   Empty
Empty       WH -1925           Empty                   Empty
Empty       Montreal-10        Empty                   Empty
Empty       canada| 05-4325    Empty                   Empty

        Example of dataframe2 - df2


WHSE_Nbr    WHSE_Desc_HR         WHSE_Desc_AD    WHSE_Abrv
1           Technology                           Tech
2           Finance                 
...         ...                 
10          Recruiting           Campus Outreach
1854        Community Relations
...         ...
1925        HumanResources
4325        Global People
9237        International Tech                          

Example of dataframe2 df2

So I want to iterate through all rows of the “Warehouse Column” of df1 to search for WHSE numbers that appear in the WHSE_Nbr column of df2. In this example, I would want my code to find 1854 in the “Warehouse” column of df1 and map that number to the associated cell in the WHSE_Desc_HR column of df2 and write “Community Relations” in the “GeneralDescription” column of df1 (to same row that contains substring “1854” In Warehouse column. And it would also write “Human Resources” to Warehouse column in same row substring “1925” appears in the Warehouse column. When the iteration reaches “Montreal 10”, I would want my code to write “Campus Outreach” to the GeneralDescription column of df1 since if there is a value in WHSE_Desc_AD of df2, this will serve as an override to what is in column “WHSE_Desc_HR” of df2. I have become familiar enough with pandas to read excel files (.xlsx) and make the data frames and change datatypes within the data frame for iteration purposes, view the data frames , but can’t figure out the most effective and efficient way to structure this code to accomplish this goal. I had to edit this question just now because i realized i left out something very important. Any time a number appears in the Warehouse column, the number I want to match always follows a hyphen or dash (-). So in df1, the Warehouse row that says"canada | 05-4325" should recognize 4325, match it with df2, and write "Global People" to the GeneralDescription column in df1. sorry guys. Help is so much appreciated and the two answers below make a very good start. Thanks

import pandas as pd

excel_file='/Users/cbri/anaconda3/WHSE_gen.xlsx'
df1 = pd.read_excel(excel_file, usecols [1,5,6,7])
excel_file='/Users/cbri/PycharmProjects/True_Dept/HR_excel.xlsx'
df2 = pd.read_excel(excel_file)
df1=df1.replace(np.nan, "Empty",regex=True)
df2=df2.replace(np.nan, "Empty",regex=True)
df1=pd.DataFrame(df1, dtype='str')
df2=pd.DataFrame(df2, dtype='str')

#yeah i need a push in the right direction, guess i should use ieriterms()?
for column in df1:
     if (df1['Warehouse'])    
#so i got as far as returning all records that contained the substring "1854" but obviously that's without the for and if statement above
     df1[df1['Warehouse'].str.contains("1854", na=False)]

Solution

  • What I would do is write a regex expression to extract the numbers from your column join the tables, and maybe do the rest in excel... (the column updates)

    df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10'],'TrueDeparment' : ['Empty','empty','empty']})
    df2 = pd.DataFrame({'Dept_Nbr' : [1854, 1925, 10], 'Dept_Desc_HR' : ['Community Relations','Human Resources','Recruiting']})
    

    Then here you can try what the function does:

    line = 'Merch - 1854 '
    match = re.search(r'[0-9]+', line)
    if match is None:
        print(0)
    else:
        print(int(match[0]))
    

    If you need the match after a character as specified in your comment use this one:

    line = '12125 15151 Merch -1854 '
    match = re.search(r'(?<=-)[0-9]+', line)
    if match is None:
        print(0)
    else:
        print(int(match[0]))
    

    Note that if you have spaces or other characters after the "-" you need to add it to the regex to work!

    Important - you suppose that you only have one number in your text - if not it returns 0 you can change it as you wish the point is that at least it doesn't fail

    Write the function:

    def extract_number(field):
        match = re.search(r'(?<=-)[0-9]+', field)
        if match is None:
             return 0
        else:
             return int(match[0])
    

    Apply to dataframe:

     df1['num_col'] = df1[['Department']].apply(lambda row:extract_number(row['Department']),axis=1)
    

    Lastly do the join:

    df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])
    

    From here you can figure out which column you need whether here in Python, or in excel.