Search code examples
pythonpandasnested-lists

How to extract data from lists as strings, and select data by value, in pandas?


I have a dataframe like this:

col1              col2
[abc, bcd, dog]   [[.4], [.5], [.9]]
[cat, bcd, def]   [[.9], [.5], [.4]]

the numbers in the col2 lists describe the element (based on list index location) in col1. So ".4" in col2 describes "abc" in col1.

I want to create 2 new columns, one that pulls only the elements in col1 that are >= .9 in col2, and the other column as the number in col2; so ".9" for both rows.

Result:

col3     col4
[dog]   .9
[cat]   .9

I think going a route where removing the nested list from col2 is fine. But that's harder than it sounds. I've been trying for an hour to remove those fing brackets.

Attempts:

spec_chars3 = ["[","]"]

for char in spec_chars3: # didn't work, turned everything to nan
    df1['avg_jaro_company_word_scores'] = df1['avg_jaro_company_word_scores'].str.replace(char, '')

df.col2.str.strip('[]') #didn't work b/c the nested list is still in a list, not a string

I haven't even figured out how to pull out the list index number and filter col1 on that


Solution

  • You can use list comprehensions to populate new columns with your criteria.

    df['col3'] = [
        [value for value, score in zip(c1, c2) if score[0] >= 0.9]
        for c1, c2 in zip(df['col1'], df['col2'])
    ]
    df['col4'] = [
        [score[0] for score in c2 if score[0] >= 0.9]
        for c2 in df['col2']
    

    Output

                  col1                   col2   col3   col4
    0  [abc, bcd, dog]  [[0.4], [0.5], [0.9]]  [dog]  [0.9]
    1  [cat, bcd, def]  [[0.9], [0.5], [0.4]]  [cat]  [0.9]