Search code examples
python-3.xpyspark

How can we implement the CDC (Change Data Capture) in the Apache Spark?


I want to take out the difference between previous day and current day data using Apache Spark (using Python). I am newbie in it. Any help in this regard will be helpful ?

Installed the Apache Spark and read the CSV (Previous Day in PySpark)


Solution

  • Firstly I assume you have Apache Spark Installed on your PC along with Python 3. After setting up your venv you need to install the pyspark package using pip command

    Next after get as spark session using SparkSession. Now it is about time to read the csv files for current

    old_df = spark.read.csv("/data/2024-02-01.csv", header=True, inferSchema=True)
    new_df = spark.read.csv("/data/2024-02-02.csv", header=True, inferSchema=True)
    

    After that you might need to join that data frame. You can using join function for that based upon the column you wish to join.

    Finally call the withColumn function with difference parameter to get the changes in columns you are interested.

    Hope this helps. If you still get any error please do share the error.