Search code examples
javacsvarraylistmergelarge-files

Efficiently merge 2 large csv files in java by common labels


I need to merge 2 large csv files (approximately 40 million data elements in each so ~500mb) by common row or column labels, which can be specified by the user. For example, if dataset1.csv contained:

patient_id    x1     x2    x3
pi1           1      2     3
pi3           4      5     6

and dataset2.csv contained:

patient_id    y1    y2    y3
pi0           0     0     0
pi1           11    12    13
pi2           99    98    97
pi3           14    15    16

The user could specify to merge these two files by their row labels (the patient ids) and the resulting output.csv would be:

patient_id    x1   x2   x3   y1    y2   y3
pi1           1    2    3    11    12   13
pi3           4    5    6    14    15   16

Since we combine only the information for the patient ids that are common (intersection) to both input files. My strategy to this problem was to create a HashMap, where the row or column labels to be merged (in this case the row labels, which are patient ids) are the keys and the data for the patient id is stored as an ArrayList as the value. I create a HashMap for each input data file and then merge the values based on similar keys. I represent the data as a 2-d ArrayList of type ArrayList> so the merged data also has this type. I then simply iterate through the merged ArrayList> object, which I call a Data type object, and print it to a file. The code is below:

Below is the DataMerge class that is dependent on the below Data class file.

import java.util.HashMap;
import java.util.ArrayList;

public class DataMerge {


/**Merges two Data objects by a similar label. For example, if two data sets represent
 * different data for the same set of patients, which are represented by their unique patient
 * ID, mergeData will return a data set containing only those patient IDs that are common to both
 * data sets along with the data represented in both data sets. labelInRow1 and labelInRow2 separately 
 * indicate whether the common labels are in separate rows(true) of d1 and d2, respectively, or separate columns otherwise.*/


public static Data mergeData(Data d1, Data d2, boolean labelInRow1, 
        boolean labelInRow2){
    ArrayList<ArrayList<String>> mergedData = new ArrayList<ArrayList<String>>();
    HashMap<String,ArrayList<String>> d1Map = d1.mapFeatureToData(labelInRow1);
    HashMap<String,ArrayList<String>> d2Map = d2.mapFeatureToData(labelInRow2);
    ArrayList<String> d1Features;
    ArrayList<String> d2Features;

    if (labelInRow1){
        d1Features = d1.getColumnLabels();
    } else {
        d1Features = d1.getRowLabels();
    }
    if (labelInRow2){
        d2Features = d2.getColumnLabels();
    } else {
        d2Features = d2.getRowLabels();
    }
    d1Features.trimToSize();
    d2Features.trimToSize();

    ArrayList<String> mergedFeatures = new ArrayList<String>();
    if ((d1.getLabelLabel() != "") && (d1.getLabelLabel() == "")) {
        mergedFeatures.add(d1.getLabelLabel());
    }
    else if ((d1.getLabelLabel() == "") && (d1.getLabelLabel() != "")) {
        mergedFeatures.add(d2.getLabelLabel());
    } else {
        mergedFeatures.add(d1.getLabelLabel());
    }

    mergedFeatures.addAll(d1Features);
    mergedFeatures.addAll(d2Features);
    mergedFeatures.trimToSize();
    mergedData.add(mergedFeatures);

    for (String key : d1Map.keySet()){
        ArrayList<String> curRow = new ArrayList<String>();
        if (d2Map.containsKey(key)){
            curRow.add(key);
            curRow.addAll(d1Map.get(key));
            curRow.addAll(d2Map.get(key));
            curRow.trimToSize();
            mergedData.add(curRow);
        }
    }
    mergedData.trimToSize();
    Data result = new Data(mergedData, true);
    return result;
}

}

Below is the Data type object along with its associated HashMap generating functions with some row and column label extraction methods.

import java.util.*;
import java.io.*;

/**Represents an unlabeled or labeled data set as a series of nested     ArrayLists, where each nested 
 * ArrayList represents a line of the input data.*/

public class Data {
private ArrayList<String> colLabels = new ArrayList<String>();  //row labels

private ArrayList<String> rowLabels = new ArrayList<String>();  //column labels

private String labelLabel;

private ArrayList<ArrayList<String>> unlabeledData; //data without row and column labels



/**Returns an ArrayList of ArrayLists, where each nested ArrayList represents a line
*of the input file.*/
@SuppressWarnings("resource")
private static ArrayList<ArrayList<String>> readFile(String filePath, String fileSep){
    ArrayList<ArrayList<String>> result = new ArrayList<ArrayList<String>>();
    try{
        BufferedReader input = new BufferedReader(new FileReader (filePath));
        String line = input.readLine();
        while (line != null){
            String[] splitLine = line.split(fileSep);
            result.add(new ArrayList<String>(Arrays.asList(splitLine)));
            line = input.readLine();
        }
    }
    catch (Exception e){
        System.err.println(e);
    }
    result.trimToSize();;
    return result;
}


/**Returns an ArrayList of ArrayLists, where each nested ArrayList represents a line of the input
 * data but WITHOUT any row or column labels*/


private ArrayList<ArrayList<String>> extractLabelsAndData(String filePath, String fileSep){
    ArrayList<ArrayList<String>> tempData = new ArrayList<ArrayList<String>>();
    tempData.addAll(readFile(filePath, fileSep));
    tempData.trimToSize();
    this.colLabels.addAll(tempData.remove(0));
    this.labelLabel = this.colLabels.remove(0);
    this.colLabels.trimToSize();
    for (ArrayList<String> line : tempData){
        this.rowLabels.add(line.remove(0));
    }
    this.rowLabels.trimToSize();
    return tempData;
}




/**Returns an ArrayList of ArrayLists, where each nested ArrayList represents a line of the input
 * data but WITHOUT any row or column labels. Does mutate the original data*/
private ArrayList<ArrayList<String>> extractLabelsAndData (ArrayList<ArrayList<String>> data){
    ArrayList<ArrayList<String>> result = new ArrayList<ArrayList<String>>();
    for (ArrayList<String> line : data){
        ArrayList<String> temp = new ArrayList<String>();
        for (String element : line){
            temp.add(element);
        }
        temp.trimToSize();
        result.add(temp);
    }
    this.colLabels.addAll(result.remove(0));
    this.labelLabel = this.colLabels.remove(0);
    this.colLabels.trimToSize();
    for (ArrayList<String> line : result){
        this.rowLabels.add(line.remove(0));
    }
    this.rowLabels.trimToSize();
    result.trimToSize();
    return result;
}


/**Returns the labelLabel for the data*/
public String getLabelLabel(){
    return this.labelLabel;
}


/**Returns an ArrayList of the labels while maintaining the order
* in which they appear in the data. Row indicates that the desired
* features are all in the same row. Assumed that the labels are in the
* first row of the data. */
public ArrayList<String> getColumnLabels(){
    return this.colLabels;
}


/**Returns an ArrayList of the labels while maintaining the order
* in which they appear in the data. Column indicates that the desired
* features are all in the same column. Assumed that the labels are in the
* first column of the data.*/
public ArrayList<String> getRowLabels(){
    return this.rowLabels;
}


/**Creates a HashMap where a list of feature labels are mapped to the entire data. For example,
 * if a data set contains patient IDs and test results, this function can be used to create
 * a HashMap where the keys are the patient IDs and the values are an ArrayList of the test
 * results. The boolean input isRow, which, when true, designates that the
 * desired keys are listed in the rows or false if they are in the columns.*/
public HashMap<String, ArrayList<String>> mapFeatureToData(boolean isRow){
    HashMap<String, ArrayList<String>> featureMap = new HashMap<String,ArrayList<String>>();
    if (!isRow){
        for (ArrayList<String> line : this.unlabeledData){
            for (int i = 0; i < this.colLabels.size(); i++){
                if (featureMap.containsKey(this.colLabels.get(i))){
                    featureMap.get(this.colLabels.get(i)).add(line.get(i));
                } else{
                    ArrayList<String> firstValue = new ArrayList<String>();
                    firstValue.add(line.get(i));
                    featureMap.put(this.colLabels.get(i), firstValue);
                }
            }
        }
    } else {
        for (int i = 0; i < this.rowLabels.size(); i++){
            if (!featureMap.containsKey(this.rowLabels.get(i))){
                featureMap.put(this.rowLabels.get(i), this.unlabeledData.get(i));
            } else {
                featureMap.get(this.rowLabels.get(i)).addAll(this.unlabeledData.get(i));
            }
        }
    }
    return featureMap;
} 


/**Writes the data to a file in the specified outputPath. sep indicates the data delimiter.
 * labeledOutput indicates whether or not the user wants the data written to a file to be 
 * labeled or unlabeled. If the data was unlabeled to begin with, then labeledOutput 
 * should not be set to true. */
public void writeDataToFile(String outputPath, String sep){
    try {
        PrintStream writer = new PrintStream(new BufferedOutputStream (new FileOutputStream (outputPath, true)));
        String sol = this.labelLabel + sep;
        for (int n = 0; n < this.colLabels.size(); n++){
            if (n == this.colLabels.size()-1){
                sol += this.colLabels.get(n) + "\n";
            } else {
                sol += this.colLabels.get(n) + sep;
            }
        }
        for (int i = 0; i < this.unlabeledData.size(); i++){
            ArrayList<String> line = this.unlabeledData.get(i);
            sol += this.rowLabels.get(i) + sep;
            for (int j = 0; j < line.size(); j++){
                if (j == line.size()-1){
                    sol += line.get(j);
                } else {
                    sol += line.get(j) + sep;
                }
            }
            sol += "\n";
        }
        sol = sol.trim();
        writer.print(sol);
        writer.close();

    } catch (Exception e){
        System.err.println(e);
    }
}


/**Constructor for Data object. filePath specifies the input file directory,
 * fileSep indicates the file separator used in the input file, and hasLabels
 * designates whether the input data has row and column labels. Note that if 
 * hasLabels is set to true, it is assumed that there are BOTH row and column labels*/
public Data(String filePath, String fileSep, boolean hasLabels){
    if (hasLabels){
        this.unlabeledData = extractLabelsAndData(filePath, fileSep);
        this.unlabeledData.trimToSize();
    } else {
        this.unlabeledData = readFile(filePath, fileSep);
        this.unlabeledData.trimToSize();
    }

}


/**Constructor for Data object that accepts nested ArrayLists as inputs*/
public Data (ArrayList<ArrayList<String>> data, boolean hasLabels){
    if (hasLabels){
        this.unlabeledData = extractLabelsAndData(data);
        this.unlabeledData.trimToSize();
    } else {
        this.unlabeledData = data;
        this.unlabeledData.trimToSize();
    }
}
}

The program works for small datasets but it's been 5+ days and the merge still hasn't finished. I'm looking for a more efficient time and memory solution. Someone suggested using byte arrays instead of Strings, which may make it run faster. Anyone have any suggestions?

EDIT: I did some digging around in my code and found that reading the input files and merging them takes almost no time (like 20 seconds). Writing the file is the part that takes 5+ days


Solution

  • You are concatenating all your data fields for all your millions of rows of data into one ginormous string then writing that single string. This is slow death by memory thrashing as you allocate and reallocate extremely large strings, copying them over and over and over for each field and separator you're adding to the string. Around the 3rd or 4th day each string is ... multiple millions of characters long? ... and your poor garbage collector is sweating and taking it out on you.

    Don't do that.

    Build each line of your output file separately and write it. Then build the next line.

    Furthermore, use the StringBuilder class to build the lines, though you'll get such an improvement for the previous step you might not even bother with this. Though it's the way to do it and you should learn how.