Search code examples
pythonpandasdataframedata-cleaning

stripping one column to create a new column through query


I have a large pandas data frame where one of the columns is wrapped in double quotes and some of the rows have multiple values, so they are comma separated. I want to extract these values from the names column, by stripping the quotes and then iterate through each name (we don't have to skip the blank ones since the query result will just be blank) and get a query result that will be placed in a new column, also comma separated. The other thing is that I don't want any duplicates in my query results listed. I don't need help with how to get the query results, more so the other parts in between like the stripping of quotes and then making sure the results are placed in the correct places with no duplicates.

(simplified) original pandas dataset

 ID1     ID2       names         
 01      01         "John"         
 01      02         "Kate, Ashten"       
 01      03        
 01      04         "Emily, Cathy, Joy"       

desired dataset

 ID1     ID2       names                  query_result_fav_color         
 01      01         "John"                    "pink"    
 01      02         "Kate, Ashten"            "blue"
 01      03
 01      04         "Emily, Cathy, Joy"       "red, green, blue"

what I've tried

values=original_df['names'] # series 
values=values.to_frame()
values = values.apply(lambda x: x.str.replace('"',''))


#do the query (get a 'results' list)

results= ["pink", "blue", "blue", "NA", "red", "green", "blue"]

#put the results from the query in a new column that corresponds to the correct row (need help with this)
 ID1     ID2       names                  query_result_fav_color         
 01      01         "John"                    "pink"    
 01      02         "Kate, Ashten"            "blue", "blue"
 01      03                                   "NA"
 01      04         "Emily, Cathy, Joy"       "red", "green", "blue"

# remove duplicates in same row, and make sure only 1 set of quotes around the results (need help with this)


EDIT

#this works in the sample, but when I try it on my actual data set, I'm having some issues:

df = df.assign(names = df["names"].str.split(",")).explode(column="names").assign(query_result_fav_color=results).groupby(["ID1", "ID2"])["names", "query_result_fav_color"].agg(list)

df.assign(
    names=df["names"].apply(lambda x: ", ".join(x) if x else ""),
    query_result_fav_color=df["query_result_fav_color"].apply(lambda x: ", ".join(x))
).reset_index()

#this is what my dataset looks like when I try the solution (please focus on rows 3-5). It ends up duplicating whenever there is more than 1 entry and then will continue listing the other color in the next row. 
-------------------------------------------------
    ID1 ID2 names               query_result_fav_color
0   01  01  John                pink
1   01  02  Kate, Ashton        blue, blue
2   01  03                      NA
3   01  04  Emily, Cathy, Joy   red, red, red
4   01  05  Jason               green
5   01  06                      blue  

-------------------------------------------------

Solution

  • Assuming that you already replaced '"' with '' (not necessary for the code to run) in your frame so that you have the following data frame and list:

    import pandas as pd
    
    df = pd.DataFrame({"ID1": ["01"] * 4, "ID2": ["01", "02", "03", "04"], "names": ["John", "Kate, Ashton", "", "Emily, Cathy, Joy"]})
    results= ["pink", "blue", "blue", "NA", "red", "green", "blue"]
    

    You can convert the values from 'names' to lists and explode your data frame. Then you just have to assign your list to the exploded frame (which matches the length of the list now) and re-group it back by your 'ID' columns. For this, you do not really need another package besides of pandas. Just do something like this:

    df = df.assign(names = df["names"].str.split(",")).explode(column="names").assign(query_result_fav_color=results).groupby(["ID1", "ID2"])["names", "query_result_fav_color"].agg(list)
    
    df.assign(
        names=df["names"].apply(lambda x: ", ".join(x) if x else ""),
        query_result_fav_color=df["query_result_fav_color"].apply(lambda x: ", ".join(x))
    ).reset_index()
    
    -------------------------------------------------
        ID1 ID2 names               query_result_fav_color
    0   01  01  John                pink
    1   01  02  Kate, Ashton        blue, blue
    2   01  03                      NA
    3   01  04  Emily, Cathy, Joy   red, green, blue
    -------------------------------------------------
    

    If you want to have an one-liner that does this, here you go:

    (
        df.assign(query_result_fav_color=df.assign(names = df["names"].str.split(","))
               .explode(column="names").drop(columns=["names"])
               .assign(query_result_fav_color=results)
               .groupby(["ID1", "ID2"])["query_result_fav_color"]
               .agg(list).apply(lambda x: ", ".join(x) if x else "").values)
    )