Search code examples
pythonpandaspython-re

How to check all pd.DataFrame for regular expression?


I need check few dataframes. If df do not contain regular expression, I need to clear it. I don't know column there it should be.

How to check all DataFrame for containing regular expression? Without loop to check column?

This is how I do it now:

import pandas as pd
import numpy as np
import re
import codecs

# read file
folder = 'folder_path'
file = 'file_name.html'
html_df = pd.read_html(folder + '/' + file)

# check dataframes
html_match = re.compile(r'_TOM$|_TOD$')
# add DF number with html_match
df_check = []
for i, df in enumerate(html_df):
    for col in df.columns:
        try:
            if len(df[df[col].str.contains(html_match) == True]) != 0:
                df_check.append(i)
            else:
                continue
        except AttributeError:
            continue

Solution

  • The logic is not fully clear, but if I understand correctly you want to filter the output of read_html (which is a list of DataFrames) to only keep those that contain a specific match:

    import numpy as np
    import pandas as pd
    
    html_df = [pd.DataFrame([['A', 'B', 'C_TOM'], ['D', 'E', 'F']]),
               pd.DataFrame([['A', 'B', 'C'], ['D', 'E', 'F']]),
               pd.DataFrame([['A', 'B_TOD', 'C'], ['D', 'E', 'F']]),
              ]
    
    out = []
    
    for d in html_df:
        if np.any(d.apply(lambda s: s.str.contains(r'_TOM$|_TOD$'))):
            out.append(d)
    

    Or as a list comprehension:

    out = [d for d in html_df
           if np.any(d.apply(lambda s: s.str.contains(r'_TOM$|_TOD$')))]
    

    Output:

    [   0  1      2
     0  A  B  C_TOM
     1  D  E      F,
        0      1  2
     0  A  B_TOD  C
     1  D      E  F]