Search code examples
apache-sparkrandompysparkdatabricksapache-spark-mllib

Random split generates different splits if dataframe columns are reversed


I've been playing with Databrick notebooks and I'm running into a weird issue. The logic is that I read parsed_points_df from a file, cache it and then create a Dataframe out of it. However, depending on the column order the randomSplit() of MLlib generates different datasets due to which the average of the label column is also different. Now since average is commutative and associative, shouldn't the result be the same?

I'm not sure what exactly the issue is. I've looked at different blogs and tried different techniques such as caching, repartitioning but nothing seems to be working.

The code snippet 1 and 2 are below:

Code Snippet 1

   parsed_data_df = parsed_points_df.select( parsed_points_df['labels'] - min_year, 
  'features').withColumnRenamed('(labels - 1922.0)', 'label')  //COLUMN ORDER 1

  weights = [.8, .1, .1]
  seed = 42
  parsed_train_data_df, parsed_val_data_df, parsed_test_data_df = 
  parsed_data_df.randomSplit(weights, seed=seed)
  average_train_year = (parsed_train_data_df.selectExpr('avg(label) as avg')).first()

Code Snippet 2

  parsed_data_df = parsed_points_df.select('features', parsed_points_df['labels'] -
  min_year,).withColumnRenamed('(labels - 1922.0)', 'label').  // COLUMN ORDER 2

  weights = [.8, .1, .1]
  seed = 42
  parsed_train_data_df, parsed_val_data_df, parsed_test_data_df = 
  parsed_data_df.randomSplit(weights, seed=seed)
  average_train_year = (parsed_train_data_df.selectExpr('avg(label) as avg')).first()

Solution

  • Even if you have seed specified, the splits still could be different because of the way how the df.sample is implemented. You can read following blog post that dives into why it happens.

    The general recommendation would be to read source data, perform splits, store each of dataframes as separate files, and then always use these saved dataframes in all your experiments.