Search code examples
pythonregexfor-loopapplyregex-lookarounds

Not able to figure-out where I am making a mistake


I have one data frame column dtc_mined which contains value like below seperated by |

P18A253|P18A0|P18A2|P18A043|P2B61

here it contains some values some of its length is 5 (P18A2) or some of its length is 7 (P18A043) now I want to check if the value of 5 length word is available in the value of 7 length words then the 5 length word should be removed

Below is my expected output

P18A253|P18A043|P2B61

Below is my code which I have tried

import pandas as pd

# Sample DataFrame
data = {'dtc_mined': ['P18A253|P18A0|P18A2|P18A043|P2B61']}
df = pd.DataFrame(data)

# Split the values and create sets of 5 and 7 character words
df['split_values'] = df['dtc_mined'].str.split('|')
df['words_5'] = df['split_values'].apply(lambda lst: set(word for word in lst if len(word) == 5))
df['words_7'] = df['split_values'].apply(lambda lst: set(word for word in lst if len(word) == 7))

# Remove 5-character words that have a corresponding 7-character word
df['filtered_values'] = df.apply(lambda row: '|'.join(word for word in row['split_values'] if len(word) == 7 or word not in row['words_7']), axis=1)

# Drop intermediate columns and display the result
result = df.drop(['split_values', 'words_5', 'words_7'], axis=1)
print(result)

I also tried another approach

# Remove 5-character words that have a corresponding 7-character word

def Check1(row):
    for word in row['words_5']:
        if word not in row['words_7']:
            row['words_7'].add(word)
    return row['words_7']

df['filtered_values'] = df.apply(Check1, axis=1)

Solution

  • The problem in your code is that word not in row['words_7'] allows all the 5-letter words to pass. What you need to check is that none of the words in row['words_7'] start with the 5-letter word i.e.

    df['filtered_values'] = df.apply(lambda row: '|'.join(word for word in row['split_values'] if len(word) == 7 or not any(w2.startswith(word) for w2 in row['words_7'])), axis=1)
    

    This is probably most easily done by doing everything in a function and applying that to the values:

    def filter_words(ll):
        words = ll.split('|')
        w7 = set(w for w in words if len(w) == 7)
        return '|'.join(w for w in words if w in w7 or not any(w2.startswith(w) for w2 in w7))
    

    This function forms a set of the 7 letter words, then filters the word list based on whether the word is in that set, or none of the words in the set start with the same letters.

    To use, just apply:

    df['filtered_values'] = df['dtc_mined'].apply(filter_words)
    

    Output (for your sample data):

                               dtc_mined        filtered_values
    0  P18A253|P18A0|P18A2|P18A043|P2B61  P18A253|P18A043|P2B61