I am using customer transaction data to create several features naming them as calculated columns. After creating all the calculated columns, I saved the dataframe as a parquet file. When I download the parquet file and read the same in Python on my local machine the values are not matching to my expectation.
For the same customers I tried to display their data in Databricks itself using my original Pyspark codes and strange thing happened. When I display the output for one customer with only 3 columns then the values are as expected but when I display more columns then the values are different. This is strange to me.
I even tried to cache the dataframe and then tried to display and the result is even worse, with 3 columns as well as with more than 12 columns both does not gives me the expected result.
I even made sure the datatype is same through out in all the steps for that calculated_column_9.
What could be the reason behind this behavior?
Rewrote my code by avoiding some join statements which are not necessarily required after creation of calculated_column_9. Now the issue does not exists anymore.