I am working on some data processing in a Rails app and I am trying to deal with into a performance pain-point. I have 2 arrays x_data
and y_data
that each looks as follows (With different values of course):
[
{ 'timestamp_value' => '2017-01-01 12:00', 'value' => '432' },
{ 'timestamp_value' => '2017-01-01 12:01', 'value' => '421' },
...
]
Each array has up to perhaps 25k items. I need to prepare this data for further x-y regression analysis.
Now, some values in x_data
or y_data
can be nil
. I need to remove values from both arrays if either x_data
or y_data
has a nil value at that timestamp. I then need to return the values only for both arrays.
In my current approach, I am first extracting the timestamps from both arrays where the values are not nil, then performing a set intersection on the timestamps to produce a final timestamps array. I then select values using that final array of timestamps. Here's the code:
def values_for_regression(x_data, y_data)
x_timestamps = timestamps_for(x_data)
y_timestamps = timestamps_for(y_data)
# Get final timestamps as the intersection of the two
timestamps = x_timestamps.intersection(y_timestamps)
x_values = values_for(x_data, timestamps)
y_values = values_for(y_data, timestamps)
[x_values, y_values]
end
def timestamps_for(data)
Set.new data.reject { |row| row['value'].nil? }.
map { |row| row['timestamp_value'] }
end
def values_for(data, timestamps)
data.select { |row| timestamps.include?(row['timestamp_value']) }.
map { |row| row['value'] }
end
This approach isn't terribly performant, and I need to do this on several sets of data in quick succession. The overhead of the multiple loops adds up. There must be a way to at least reduce the number of loops necessary.
Any ideas or suggestions will be appreciated.
You're doing a lot of redundant iterating and creating a lot of intermediate arrays of data.
Yourtimestamps_for
and values_for
both perform a select
followed by a map
. The select
creates an intermediate array; since your arrays are up to 25,000 items, this is potentially an intermediate throw-away array of the same size. You're doing this four times, once for x and y timestamps, and once for x and y values. You produce another intermediate array by taking the intersection of the two sets of timestamps. You also do a complete scan of both arrays for nil
s twice, once to find timestamps with non-nil values, and again mapping the timestamps you just extracted to their values.
While it's definitely more readable to functionally transform the input arrays, you can dramatically reduce memory usage and execution time by combining the various iterations and transformations.
All the iterations can be combined into a single loop over one data set (along with setup time for producing a timestamp->value lookup hash for the second set). Any timestamps not present in the first set will make a timestamp in the second set ignored anyways, so there is no reason to find all the timestamps in both sets, only to then find their intersection.
def values_for_regression(x_data, y_data)
x_values = []
y_values = []
y_map = y_data.each_with_object({}) { |data, hash| hash[data['timestamp-value']] = data['value'] }
x_data.each do |data|
next unless x_value = data['value']
next unless y_value = y_map[data['timestamp-value']]
x_values << x_value
y_values << y_value
end
[x_values, y_values]
end
I think this is functionally identical, and a quick benchmark shows a ~70% reduction in runtime:
user system total real
yours 9.640000 0.150000 9.790000 ( 9.858914)
mine 2.780000 0.060000 2.840000 ( 2.845621)