ruby csv bigdata large-files command-line-arguments

Ruby to Search and Combine CSV files when dealing with large files

Summary

Looking at the other questions that are somewhat in line with this is not helping, because I'm still opening the file line-by-line so I'm not running out of memory on the large file. In fact my memory usage is pretty low, but it is taking a really long time to create the smaller file so that I can search and concatenate the other CSV into the file.

Question

It has been 5 days and I'm not sure how far I have left to go, but it hasn't exited the foreach line of the main file, there are 17.8 million records in the csv file. Is there a faster way to handle this processing in ruby? Anything I can do to the MacOSX to optimize it? Any advice would be great.

# # -------------------------------------------------------------------------------------
# # USED TO GET ID NUMBERS OF THE SPECIFIC ITEMS THAT ARE NEEDED
# # -------------------------------------------------------------------------------------
etas_title_file = './HathiTrust ETAS Titles.csv'
oclc_id_array = []
angies_csv = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row| 
  oclc_id_array << row[:oclc]
  angies_csv << row.to_h
end 
oclc_id_array.uniq!


# -------------------------------------------------------------------------------------
# RUN ONCE IF DATABASE IS NOT POPULATED
# -------------------------------------------------------------------------------------

headers = %i[htid   access  rights  ht_bib_key  description source  source_bib_num  oclc_num    isbn    issn    lccn    title   imprint rights_reason_code  rights_timestamp    us_gov_doc_flag rights_date_used    pub_place   lang    bib_fmt collection_code content_provider_code   responsible_entity_code digitization_agent_code access_profile_code author]

remove_keys = %i[access rights description  source  source_bib_num isbn issn    lccn    title   imprint rights_reason_code  rights_timestamp    us_gov_doc_flag rights_date_used    pub_place   lang    bib_fmt collection_code content_provider_code   responsible_entity_code digitization_agent_code access_profile_code author]

new_hathi_csv = []
processed_keys = []
CSV.foreach('./hathi_full_20200401.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row| 
  next unless oclc_id_array.include? row[:oclc_num]
  next if processed_keys.include? row[:oclc_num]
  puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}"
  new_hathi_csv << row.to_h.except(*remove_keys)
  processed_keys << row[:oclc_num]
end

Solution

As far as I was able to determine, OCLC IDs are alphanumeric. This means we want to use a Hash to store these IDs. A Hash has a general lookup complexity of O(1), while your unsorted Array has a lookup complexity of O(n).

If you use an Array, you worst case lookup is 18 million comparisons (to find a single element, Ruby has to go through all 18 million IDs), while with a Hash it will be one comparison. To put it simply: using a Hash will be millions of times faster than your current implementation.

The pseudocode below will give you an idea how to proceed. We will use a Set, which is like a Hash, but handy when all you need to do is check for inclusion:

oclc_ids = Set.new

CSV.foreach(...) {
  oclc_ids.add(row[:oclc])  # Add ID to Set
  ...
}

# No need to call unique on a Set. 
# The elements in a Set are always unique.

processed_keys = Set.new

CSV.foreach(...) {
   next unless oclc_ids.include?(row[:oclc_num])   # Extremely fast lookup
   next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
   ...
   processed_keys.add(row[:oclc_num])
}