Search code examples
pandasnumericalphadrop

How to drop pandas column containing both alpha and numeric values across all records


I have a pandas dataframe called df of about 2 million records. There is a column called transaction_id that might contain:

  • alpha values (e.g. "abscdwew") for some records
  • numeric values (e.g. "123454") for some records
  • both alpha and numeric values (e.g. "asd12354") for some records
  • alpha, numeric and special characters (e.g. "asd435_!") for some records
  • special characters (e.g. "_-!")

I want to drop that column if ALL values (i.e. across ALL records) contain:

  • combination of alpha and numeric values (e.g. "aseder345")
  • combination of alpha and special characters (e.g. "asedre_!")
  • combination of numeric and special characters (e.g. "123_!")
  • all special characters (e.g. "!")

Is there a pythonic way of doing so?

So, if a column contains across al


Solution

  • Given the following toy dataframe, in which col1 should be removed and col2 should be kept according to your criteria:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "col1": [
                "abs@&wew",
                "123!45!4",
                "asd12354",
                "asdfzf_!",
                "123_!",
                "asd435_!",
                "_-!",
            ],
            "col2": [
                "abscdwew",
                "123454",
                "asd12354",
                "a_!sdfzf",
                "123_!",
                "asd435_!",
                "_-!",
            ],
        }
    )
    

    Here is one way to do it:

    test = lambda x: True if x.isalpha() or x.isdigit() else False
    cols_to_keep = df.apply(lambda x: any(test(x) for x in x))
    
    df = df.loc[:, cols_to_keep]
    
    print(df)
    # Output
           col2
    0  abscdwew
    1    123454
    2  asd12354
    3  a_!sdfzf
    4     123_!
    5  asd435_!
    6       _-!