Search code examples
databricksazure-databricksspark-notebookazure-notebooks

DataFrame to filter rows having Special Characters


I have a DataFrame having around 50K to 100K rows

This DataFrame has around 4 columns.

We need to filter the DataFrame to discard any special character rows

  • 199 Central Avenue
  • 1664 O'block Road
  • 1630 Hahn's Dairy Road
  • "N 40 Degrees 36' 15"" W -75 Degrees -27' -52"""
  • 4061605 North Lat-7538'39* West Long.
  • "40�3'13"" North

The DataFrame should only filter the last row. All other rows should be part of the DataFrame

How do we filter the DataFrame to not include only this last row


Solution

  • You can define a function has_special_chars. This function uses a refined regular expression to check for non-ASCII characters using the code below:

    def has_special_chars(s):
        return bool(re.search(r'[^\x00-\x7F]', s))
    

    Apply the function to the DataFrame as mentioned below:

    filtered_df = df[~df.apply(lambda row: any(has_special_chars(str(cell)) for cell in row), axis=1)]
    

    It filters out rows where any cell contains non-ASCII characters. Here is a sample DataFrame with the defined function applied:

    import pandas as pd
    import re
    
    data = {
        'column1': [
            '199 Central Avenue',
            '1664 O\'block Road',
            '1630 Hahn\'s Dairy Road',
            'N 40 Degrees 36\' 15" W -75 Degrees -27\' -52"',
            '4061605 North Lat-7538\'39* West Long.',
            '40�3\'13" North',
            '40@$%½3\'13" North'
        ],
        'column2': [
            'Some other data',
            'More data',
            'Even more data',
            'Data',
            'Random text',
            'Special char text',
            'cvs'
        ],
        'column3': [
            '123',
            '456',
            '789',
            '012',
            '345',
            '678',
            '243'
        ],
        'column4': [
            'Another column',
            'With some text',
            'Without special char',
            'Text',
            'More text',
            'More special char text',
            'd1'
        ]
    }
    
    df = pd.DataFrame(data)
    
    # Define a function to check for special characters in a string
    def has_special_chars(s):
        return bool(re.search(r'[^\x00-\x7F]', s))
    
    # Apply the function to filter the DataFrame
    filtered_df = df[~df.apply(lambda row: any(has_special_chars(str(cell)) for cell in row), axis=1)]
    
    print(filtered_df)
    

    You will get the output as shown below:

    column1 column2 column3 column4
    199 Central Avenue Some other data 123 Another column
    1664 O'block Road More data 456 With some text
    1630 Hahn's Dairy Road Even more data 789 Without special char
    N 40 Degrees 36' 15" W -75 Degrees -27' -52" Data 012 Text
    4061605 North Lat-7538'39* West Long. Random text 345 More text

    If it is spark data frame you can use below code:

    import re
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import udf
    from pyspark.sql.types import BooleanType
    
    # Initialize Spark session
    spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()
    # Given DataFrame
    data = [
        ('199 Central Avenue', 'Some other data', '123', 'Another column'),
        ('1664 O\'block Road', 'More data', '456', 'With some text'),
        ('1630 Hahn\'s Dairy Road', 'Even more data', '789', 'Without special char'),
        ('N 40 Degrees 36\' 15" W -75 Degrees -27\' -52"', 'Data', '012', 'Text'),
        ('4061605 North Lat-7538\'39* West Long.', 'Random text', '345', 'More text'),
        ('40�3\'13" North', 'Special char text', '678', 'More special char text')
    ]
    
    schema = ["column1", "column2", "column3", "column4"]
    # Create DataFrame
    df = spark.createDataFrame(data, schema)
    # Define the function to check for special characters
    def has_special_chars(s):
        return bool(re.search(r'[^\x00-\x7F]', s))
    # Register the UDF
    has_special_chars_udf = udf(has_special_chars, BooleanType())
    # Apply the UDF to filter the DataFrame
    filtered_df = df.filter(~has_special_chars_udf(df.column1) & 
                            ~has_special_chars_udf(df.column2) & 
                            ~has_special_chars_udf(df.column3) & 
                            ~has_special_chars_udf(df.column4))
    # Show the filtered DataFrame
    filtered_df.show()