Search code examples
pysparkdatabricksazure-databricks

Pyspark codes shows different values when displaying the dataframe for some customers alone


I am using customer transaction data to create several features naming them as calculated columns. After creating all the calculated columns, I saved the dataframe as a parquet file. When I download the parquet file and read the same in Python on my local machine the values are not matching to my expectation.

For the same customers I tried to display their data in Databricks itself using my original Pyspark codes and strange thing happened. When I display the output for one customer with only 3 columns then the values are as expected but when I display more columns then the values are different. This is strange to me.

enter image description here

I even tried to cache the dataframe and then tried to display and the result is even worse, with 3 columns as well as with more than 12 columns both does not gives me the expected result.

I even made sure the datatype is same through out in all the steps for that calculated_column_9.

What could be the reason behind this behavior?

enter image description here


Solution

  • Rewrote my code by avoiding some join statements which are not necessarily required after creation of calculated_column_9. Now the issue does not exists anymore.