Search code examples
apache-spark

Does Spark preserve record order when reading in ordered files?


I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just in a relative order within the file). I'd like to use this ordering information in my Spark processing, to do things like comparing a row with the previous row. I can't explicitly order the records, since there is no ordering column.

Does Spark maintain the order of records it reads in from a file? Or, is there any way to access the file-order of records from Spark?


Solution

  • Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq. numbers to the records and use those seq. numbers while processing.

    In a distribute framework like Spark where data is divided in cluster for fast processing, shuffling of data is sure to occur. So the best solution is create a sequential numbers to each rows and use that sequential number for ordering.