Search code examples
ruby-on-railsrubydata-analysisdata-processing

Fastest way to remove values from one array based on another array


I am working on some data processing in a Rails app and I am trying to deal with into a performance pain-point. I have 2 arrays x_data and y_data that each looks as follows (With different values of course):

[
  { 'timestamp_value' => '2017-01-01 12:00', 'value' => '432' },
  { 'timestamp_value' => '2017-01-01 12:01', 'value' => '421' },
  ...
]

Each array has up to perhaps 25k items. I need to prepare this data for further x-y regression analysis.

Now, some values in x_data or y_data can be nil. I need to remove values from both arrays if either x_data or y_data has a nil value at that timestamp. I then need to return the values only for both arrays.

In my current approach, I am first extracting the timestamps from both arrays where the values are not nil, then performing a set intersection on the timestamps to produce a final timestamps array. I then select values using that final array of timestamps. Here's the code:

def values_for_regression(x_data, y_data)
  x_timestamps = timestamps_for(x_data)
  y_timestamps = timestamps_for(y_data)

  # Get final timestamps as the intersection of the two
  timestamps = x_timestamps.intersection(y_timestamps)

  x_values = values_for(x_data, timestamps)
  y_values = values_for(y_data, timestamps)

  [x_values, y_values]
end

def timestamps_for(data)
  Set.new data.reject { |row| row['value'].nil? }.
              map { |row| row['timestamp_value'] }
end

def values_for(data, timestamps)
  data.select { |row| timestamps.include?(row['timestamp_value']) }.
      map { |row| row['value'] }
end

This approach isn't terribly performant, and I need to do this on several sets of data in quick succession. The overhead of the multiple loops adds up. There must be a way to at least reduce the number of loops necessary.

Any ideas or suggestions will be appreciated.


Solution

  • You're doing a lot of redundant iterating and creating a lot of intermediate arrays of data.

    Yourtimestamps_for and values_for both perform a select followed by a map. The select creates an intermediate array; since your arrays are up to 25,000 items, this is potentially an intermediate throw-away array of the same size. You're doing this four times, once for x and y timestamps, and once for x and y values. You produce another intermediate array by taking the intersection of the two sets of timestamps. You also do a complete scan of both arrays for nils twice, once to find timestamps with non-nil values, and again mapping the timestamps you just extracted to their values.

    While it's definitely more readable to functionally transform the input arrays, you can dramatically reduce memory usage and execution time by combining the various iterations and transformations.

    All the iterations can be combined into a single loop over one data set (along with setup time for producing a timestamp->value lookup hash for the second set). Any timestamps not present in the first set will make a timestamp in the second set ignored anyways, so there is no reason to find all the timestamps in both sets, only to then find their intersection.

    def values_for_regression(x_data, y_data)
      x_values = []
      y_values = []
    
      y_map = y_data.each_with_object({}) { |data, hash| hash[data['timestamp-value']] = data['value'] }
    
      x_data.each do |data|
        next unless x_value = data['value']
        next unless y_value = y_map[data['timestamp-value']]
        x_values << x_value
        y_values << y_value
      end
    
      [x_values, y_values]
    end
    

    I think this is functionally identical, and a quick benchmark shows a ~70% reduction in runtime:

                     user     system      total        real
    yours        9.640000   0.150000   9.790000 (  9.858914)
    mine         2.780000   0.060000   2.840000 (  2.845621)