Search code examples
pythonpandasdataframestring-comparison

Pandas row-wise check if element string starts with string from other element (2 different columns)


I received a .csv file with 2 columns (simplified). One column contains a data and the other a filename. Unfortunately, this filename might be incorrect, which I have to determine by comparing the date and file name.

What I want

# initial situation
d = {
    'call_date': ["20200102-09", "20191203-04", "20200103-10"],
    'filename': ["20200102-09xx.wav", "20200102-10yy.wav", "20200103-10zz.wav"]
}
df = pd.DataFrame(data=d)
print(df)

#      call_date           filename
# 0  20200102-09  20200102-09xx.wav
# 1  20191203-04  20200102-10yy.wav
# 2  20200103-10  20200103-10zz.wav
...

# desired result
print(pd.Series([True, False, True]))
# 0     True
# 1    False
# 2     True
# dtype: bool

With the desired result I can count how many wrong files I have and filter the DataFrame to only contain valid entries.

What I tried

Normally a comparison would work like this:

# True / False
df["call_date"] == df["filename"]
# filter DF
df[df["call_date"] == df["filename"]]

Pandas has a pandas.Series.str.startswith function, however that works only for a single string and not for something such as:

df["filename"].str.startswith(df["call_date"])

# 0   NaN
# 1   NaN
# 2   NaN
# Name: filename, dtype: float64

Question

How do I do a row-wise comparison whether the elements in column "filename" start with the string found in column "call_date"?


Solution

  • Use list comprehension with startswith - output is list, which can be used for filtering by boolean indexing

    m = [x.startswith(y) for x, y in df[['filename','call_date']].values]
    

    Or:

    m = [x.startswith(y) for x, y in zip(df['filename'], df['call_date'])]
    

    print (m)
    [True, False, True]
    

    Another solution, but slowier:

    m = df.apply(lambda x: x['filename'].startswith(x['call_date']), axis=1)
    print (m)
    0     True
    1    False
    2     True
    dtype: bool