I received a .csv file with 2 columns (simplified). One column contains a data and the other a filename. Unfortunately, this filename might be incorrect, which I have to determine by comparing the date and file name.
# initial situation
d = {
'call_date': ["20200102-09", "20191203-04", "20200103-10"],
'filename': ["20200102-09xx.wav", "20200102-10yy.wav", "20200103-10zz.wav"]
}
df = pd.DataFrame(data=d)
print(df)
# call_date filename
# 0 20200102-09 20200102-09xx.wav
# 1 20191203-04 20200102-10yy.wav
# 2 20200103-10 20200103-10zz.wav
...
# desired result
print(pd.Series([True, False, True]))
# 0 True
# 1 False
# 2 True
# dtype: bool
With the desired result I can count how many wrong files I have and filter the DataFrame to only contain valid entries.
Normally a comparison would work like this:
# True / False
df["call_date"] == df["filename"]
# filter DF
df[df["call_date"] == df["filename"]]
Pandas has a pandas.Series.str.startswith function, however that works only for a single string and not for something such as:
df["filename"].str.startswith(df["call_date"])
# 0 NaN
# 1 NaN
# 2 NaN
# Name: filename, dtype: float64
How do I do a row-wise comparison whether the elements in column "filename" start with the string found in column "call_date"?
Use list comprehension with startswith
- output is list, which can be used for filtering by boolean indexing
m = [x.startswith(y) for x, y in df[['filename','call_date']].values]
Or:
m = [x.startswith(y) for x, y in zip(df['filename'], df['call_date'])]
print (m)
[True, False, True]
Another solution, but slowier:
m = df.apply(lambda x: x['filename'].startswith(x['call_date']), axis=1)
print (m)
0 True
1 False
2 True
dtype: bool