databricks azure-databricks spark-notebook azure-notebooks

DataFrame to filter rows having Special Characters

I have a DataFrame having around 50K to 100K rows

This DataFrame has around 4 columns.

We need to filter the DataFrame to discard any special character rows

199 Central Avenue
1664 O'block Road
1630 Hahn's Dairy Road
"N 40 Degrees 36' 15"" W -75 Degrees -27' -52"""
4061605 North Lat-7538'39* West Long.
"40ï¿½3'13"" North

The DataFrame should only filter the last row. All other rows should be part of the DataFrame

How do we filter the DataFrame to not include only this last row

Solution

You can define a function has_special_chars. This function uses a refined regular expression to check for non-ASCII characters using the code below:

def has_special_chars(s):
    return bool(re.search(r'[^\x00-\x7F]', s))

Apply the function to the DataFrame as mentioned below:

filtered_df = df[~df.apply(lambda row: any(has_special_chars(str(cell)) for cell in row), axis=1)]

It filters out rows where any cell contains non-ASCII characters. Here is a sample DataFrame with the defined function applied:

import pandas as pd
import re

data = {
    'column1': [
        '199 Central Avenue',
        '1664 O\'block Road',
        '1630 Hahn\'s Dairy Road',
        'N 40 Degrees 36\' 15" W -75 Degrees -27\' -52"',
        '4061605 North Lat-7538\'39* West Long.',
        '40ï¿½3\'13" North',
        '40@$%½3\'13" North'
    ],
    'column2': [
        'Some other data',
        'More data',
        'Even more data',
        'Data',
        'Random text',
        'Special char text',
        'cvs'
    ],
    'column3': [
        '123',
        '456',
        '789',
        '012',
        '345',
        '678',
        '243'
    ],
    'column4': [
        'Another column',
        'With some text',
        'Without special char',
        'Text',
        'More text',
        'More special char text',
        'd1'
    ]
}

df = pd.DataFrame(data)

# Define a function to check for special characters in a string
def has_special_chars(s):
    return bool(re.search(r'[^\x00-\x7F]', s))

# Apply the function to filter the DataFrame
filtered_df = df[~df.apply(lambda row: any(has_special_chars(str(cell)) for cell in row), axis=1)]

print(filtered_df)

You will get the output as shown below:

column1	column2	column3	column4
199 Central Avenue	Some other data	123	Another column
1664 O'block Road	More data	456	With some text
1630 Hahn's Dairy Road	Even more data	789	Without special char
N 40 Degrees 36' 15" W -75 Degrees -27' -52"	Data	012	Text
4061605 North Lat-7538'39* West Long.	Random text	345	More text

If it is spark data frame you can use below code:

import re
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

# Initialize Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()
# Given DataFrame
data = [
    ('199 Central Avenue', 'Some other data', '123', 'Another column'),
    ('1664 O\'block Road', 'More data', '456', 'With some text'),
    ('1630 Hahn\'s Dairy Road', 'Even more data', '789', 'Without special char'),
    ('N 40 Degrees 36\' 15" W -75 Degrees -27\' -52"', 'Data', '012', 'Text'),
    ('4061605 North Lat-7538\'39* West Long.', 'Random text', '345', 'More text'),
    ('40ï¿½3\'13" North', 'Special char text', '678', 'More special char text')
]

schema = ["column1", "column2", "column3", "column4"]
# Create DataFrame
df = spark.createDataFrame(data, schema)
# Define the function to check for special characters
def has_special_chars(s):
    return bool(re.search(r'[^\x00-\x7F]', s))
# Register the UDF
has_special_chars_udf = udf(has_special_chars, BooleanType())
# Apply the UDF to filter the DataFrame
filtered_df = df.filter(~has_special_chars_udf(df.column1) & 
                        ~has_special_chars_udf(df.column2) & 
                        ~has_special_chars_udf(df.column3) & 
                        ~has_special_chars_udf(df.column4))
# Show the filtered DataFrame
filtered_df.show()