I need check few dataframes. If df do not contain regular expression, I need to clear it. I don't know column there it should be.
How to check all DataFrame for containing regular expression? Without loop to check column?
This is how I do it now:
import pandas as pd
import numpy as np
import re
import codecs
# read file
folder = 'folder_path'
file = 'file_name.html'
html_df = pd.read_html(folder + '/' + file)
# check dataframes
html_match = re.compile(r'_TOM$|_TOD$')
# add DF number with html_match
df_check = []
for i, df in enumerate(html_df):
for col in df.columns:
try:
if len(df[df[col].str.contains(html_match) == True]) != 0:
df_check.append(i)
else:
continue
except AttributeError:
continue
The logic is not fully clear, but if I understand correctly you want to filter the output of read_html
(which is a list of DataFrames) to only keep those that contain a specific match:
import numpy as np
import pandas as pd
html_df = [pd.DataFrame([['A', 'B', 'C_TOM'], ['D', 'E', 'F']]),
pd.DataFrame([['A', 'B', 'C'], ['D', 'E', 'F']]),
pd.DataFrame([['A', 'B_TOD', 'C'], ['D', 'E', 'F']]),
]
out = []
for d in html_df:
if np.any(d.apply(lambda s: s.str.contains(r'_TOM$|_TOD$'))):
out.append(d)
Or as a list comprehension:
out = [d for d in html_df
if np.any(d.apply(lambda s: s.str.contains(r'_TOM$|_TOD$')))]
Output:
[ 0 1 2
0 A B C_TOM
1 D E F,
0 1 2
0 A B_TOD C
1 D E F]