I need to merge 2 large csv files (approximately 40 million data elements in each so ~500mb) by common row or column labels, which can be specified by the user. For example, if dataset1.csv contained:
patient_id x1 x2 x3
pi1 1 2 3
pi3 4 5 6
and dataset2.csv contained:
patient_id y1 y2 y3
pi0 0 0 0
pi1 11 12 13
pi2 99 98 97
pi3 14 15 16
The user could specify to merge these two files by their row labels (the patient ids) and the resulting output.csv would be:
patient_id x1 x2 x3 y1 y2 y3
pi1 1 2 3 11 12 13
pi3 4 5 6 14 15 16
Since we combine only the information for the patient ids that are common (intersection) to both input files. My strategy to this problem was to create a HashMap, where the row or column labels to be merged (in this case the row labels, which are patient ids) are the keys and the data for the patient id is stored as an ArrayList as the value. I create a HashMap for each input data file and then merge the values based on similar keys. I represent the data as a 2-d ArrayList of type ArrayList> so the merged data also has this type. I then simply iterate through the merged ArrayList> object, which I call a Data type object, and print it to a file. The code is below:
Below is the DataMerge class that is dependent on the below Data class file.
import java.util.HashMap;
import java.util.ArrayList;
public class DataMerge {
/**Merges two Data objects by a similar label. For example, if two data sets represent
* different data for the same set of patients, which are represented by their unique patient
* ID, mergeData will return a data set containing only those patient IDs that are common to both
* data sets along with the data represented in both data sets. labelInRow1 and labelInRow2 separately
* indicate whether the common labels are in separate rows(true) of d1 and d2, respectively, or separate columns otherwise.*/
public static Data mergeData(Data d1, Data d2, boolean labelInRow1,
boolean labelInRow2){
ArrayList<ArrayList<String>> mergedData = new ArrayList<ArrayList<String>>();
HashMap<String,ArrayList<String>> d1Map = d1.mapFeatureToData(labelInRow1);
HashMap<String,ArrayList<String>> d2Map = d2.mapFeatureToData(labelInRow2);
ArrayList<String> d1Features;
ArrayList<String> d2Features;
if (labelInRow1){
d1Features = d1.getColumnLabels();
} else {
d1Features = d1.getRowLabels();
}
if (labelInRow2){
d2Features = d2.getColumnLabels();
} else {
d2Features = d2.getRowLabels();
}
d1Features.trimToSize();
d2Features.trimToSize();
ArrayList<String> mergedFeatures = new ArrayList<String>();
if ((d1.getLabelLabel() != "") && (d1.getLabelLabel() == "")) {
mergedFeatures.add(d1.getLabelLabel());
}
else if ((d1.getLabelLabel() == "") && (d1.getLabelLabel() != "")) {
mergedFeatures.add(d2.getLabelLabel());
} else {
mergedFeatures.add(d1.getLabelLabel());
}
mergedFeatures.addAll(d1Features);
mergedFeatures.addAll(d2Features);
mergedFeatures.trimToSize();
mergedData.add(mergedFeatures);
for (String key : d1Map.keySet()){
ArrayList<String> curRow = new ArrayList<String>();
if (d2Map.containsKey(key)){
curRow.add(key);
curRow.addAll(d1Map.get(key));
curRow.addAll(d2Map.get(key));
curRow.trimToSize();
mergedData.add(curRow);
}
}
mergedData.trimToSize();
Data result = new Data(mergedData, true);
return result;
}
}
Below is the Data type object along with its associated HashMap generating functions with some row and column label extraction methods.
import java.util.*;
import java.io.*;
/**Represents an unlabeled or labeled data set as a series of nested ArrayLists, where each nested
* ArrayList represents a line of the input data.*/
public class Data {
private ArrayList<String> colLabels = new ArrayList<String>(); //row labels
private ArrayList<String> rowLabels = new ArrayList<String>(); //column labels
private String labelLabel;
private ArrayList<ArrayList<String>> unlabeledData; //data without row and column labels
/**Returns an ArrayList of ArrayLists, where each nested ArrayList represents a line
*of the input file.*/
@SuppressWarnings("resource")
private static ArrayList<ArrayList<String>> readFile(String filePath, String fileSep){
ArrayList<ArrayList<String>> result = new ArrayList<ArrayList<String>>();
try{
BufferedReader input = new BufferedReader(new FileReader (filePath));
String line = input.readLine();
while (line != null){
String[] splitLine = line.split(fileSep);
result.add(new ArrayList<String>(Arrays.asList(splitLine)));
line = input.readLine();
}
}
catch (Exception e){
System.err.println(e);
}
result.trimToSize();;
return result;
}
/**Returns an ArrayList of ArrayLists, where each nested ArrayList represents a line of the input
* data but WITHOUT any row or column labels*/
private ArrayList<ArrayList<String>> extractLabelsAndData(String filePath, String fileSep){
ArrayList<ArrayList<String>> tempData = new ArrayList<ArrayList<String>>();
tempData.addAll(readFile(filePath, fileSep));
tempData.trimToSize();
this.colLabels.addAll(tempData.remove(0));
this.labelLabel = this.colLabels.remove(0);
this.colLabels.trimToSize();
for (ArrayList<String> line : tempData){
this.rowLabels.add(line.remove(0));
}
this.rowLabels.trimToSize();
return tempData;
}
/**Returns an ArrayList of ArrayLists, where each nested ArrayList represents a line of the input
* data but WITHOUT any row or column labels. Does mutate the original data*/
private ArrayList<ArrayList<String>> extractLabelsAndData (ArrayList<ArrayList<String>> data){
ArrayList<ArrayList<String>> result = new ArrayList<ArrayList<String>>();
for (ArrayList<String> line : data){
ArrayList<String> temp = new ArrayList<String>();
for (String element : line){
temp.add(element);
}
temp.trimToSize();
result.add(temp);
}
this.colLabels.addAll(result.remove(0));
this.labelLabel = this.colLabels.remove(0);
this.colLabels.trimToSize();
for (ArrayList<String> line : result){
this.rowLabels.add(line.remove(0));
}
this.rowLabels.trimToSize();
result.trimToSize();
return result;
}
/**Returns the labelLabel for the data*/
public String getLabelLabel(){
return this.labelLabel;
}
/**Returns an ArrayList of the labels while maintaining the order
* in which they appear in the data. Row indicates that the desired
* features are all in the same row. Assumed that the labels are in the
* first row of the data. */
public ArrayList<String> getColumnLabels(){
return this.colLabels;
}
/**Returns an ArrayList of the labels while maintaining the order
* in which they appear in the data. Column indicates that the desired
* features are all in the same column. Assumed that the labels are in the
* first column of the data.*/
public ArrayList<String> getRowLabels(){
return this.rowLabels;
}
/**Creates a HashMap where a list of feature labels are mapped to the entire data. For example,
* if a data set contains patient IDs and test results, this function can be used to create
* a HashMap where the keys are the patient IDs and the values are an ArrayList of the test
* results. The boolean input isRow, which, when true, designates that the
* desired keys are listed in the rows or false if they are in the columns.*/
public HashMap<String, ArrayList<String>> mapFeatureToData(boolean isRow){
HashMap<String, ArrayList<String>> featureMap = new HashMap<String,ArrayList<String>>();
if (!isRow){
for (ArrayList<String> line : this.unlabeledData){
for (int i = 0; i < this.colLabels.size(); i++){
if (featureMap.containsKey(this.colLabels.get(i))){
featureMap.get(this.colLabels.get(i)).add(line.get(i));
} else{
ArrayList<String> firstValue = new ArrayList<String>();
firstValue.add(line.get(i));
featureMap.put(this.colLabels.get(i), firstValue);
}
}
}
} else {
for (int i = 0; i < this.rowLabels.size(); i++){
if (!featureMap.containsKey(this.rowLabels.get(i))){
featureMap.put(this.rowLabels.get(i), this.unlabeledData.get(i));
} else {
featureMap.get(this.rowLabels.get(i)).addAll(this.unlabeledData.get(i));
}
}
}
return featureMap;
}
/**Writes the data to a file in the specified outputPath. sep indicates the data delimiter.
* labeledOutput indicates whether or not the user wants the data written to a file to be
* labeled or unlabeled. If the data was unlabeled to begin with, then labeledOutput
* should not be set to true. */
public void writeDataToFile(String outputPath, String sep){
try {
PrintStream writer = new PrintStream(new BufferedOutputStream (new FileOutputStream (outputPath, true)));
String sol = this.labelLabel + sep;
for (int n = 0; n < this.colLabels.size(); n++){
if (n == this.colLabels.size()-1){
sol += this.colLabels.get(n) + "\n";
} else {
sol += this.colLabels.get(n) + sep;
}
}
for (int i = 0; i < this.unlabeledData.size(); i++){
ArrayList<String> line = this.unlabeledData.get(i);
sol += this.rowLabels.get(i) + sep;
for (int j = 0; j < line.size(); j++){
if (j == line.size()-1){
sol += line.get(j);
} else {
sol += line.get(j) + sep;
}
}
sol += "\n";
}
sol = sol.trim();
writer.print(sol);
writer.close();
} catch (Exception e){
System.err.println(e);
}
}
/**Constructor for Data object. filePath specifies the input file directory,
* fileSep indicates the file separator used in the input file, and hasLabels
* designates whether the input data has row and column labels. Note that if
* hasLabels is set to true, it is assumed that there are BOTH row and column labels*/
public Data(String filePath, String fileSep, boolean hasLabels){
if (hasLabels){
this.unlabeledData = extractLabelsAndData(filePath, fileSep);
this.unlabeledData.trimToSize();
} else {
this.unlabeledData = readFile(filePath, fileSep);
this.unlabeledData.trimToSize();
}
}
/**Constructor for Data object that accepts nested ArrayLists as inputs*/
public Data (ArrayList<ArrayList<String>> data, boolean hasLabels){
if (hasLabels){
this.unlabeledData = extractLabelsAndData(data);
this.unlabeledData.trimToSize();
} else {
this.unlabeledData = data;
this.unlabeledData.trimToSize();
}
}
}
The program works for small datasets but it's been 5+ days and the merge still hasn't finished. I'm looking for a more efficient time and memory solution. Someone suggested using byte arrays instead of Strings, which may make it run faster. Anyone have any suggestions?
EDIT: I did some digging around in my code and found that reading the input files and merging them takes almost no time (like 20 seconds). Writing the file is the part that takes 5+ days
You are concatenating all your data fields for all your millions of rows of data into one ginormous string then writing that single string. This is slow death by memory thrashing as you allocate and reallocate extremely large strings, copying them over and over and over for each field and separator you're adding to the string. Around the 3rd or 4th day each string is ... multiple millions of characters long? ... and your poor garbage collector is sweating and taking it out on you.
Don't do that.
Build each line of your output file separately and write it. Then build the next line.
Furthermore, use the StringBuilder
class to build the lines, though you'll get such an improvement for the previous step you might not even bother with this. Though it's the way to do it and you should learn how.