Search code examples
javaoptimizationlarge-data

Optimizing large data file reading in Java


I'm writing an application to help improve machine translations for my dissertation. For this, I require huge amount of ngram data. I've got the data from Google, but it's not in a useful format.

Here's how Google's data is formatted:

ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

Here's what I'm after:

ngram total_match_count_for_all_years

So, I've written a small application to run through the files and pull out the ngrams and aggregate the data over multiple years to get the total count. It, so it seems, runs fine. But, since the Google files are so big (1.5GB each! There's 99 of them >.<) it's taking a long time to get through them all.

Here's the code:

public class mergeData
{
    private static List<String> storedNgrams    = new ArrayList<String>(100001);
    private static List<String> storedParts     = new ArrayList<String>(100001);
    private static List<String> toWritePairs    = new ArrayList<String>(100001);
    private static int          rows            = 0;
    private static int          totalFreq       = 0;

    public static void main(String[] args) throws Exception
        {
            File bigram = new File("data01");
            BufferedReader in = new BufferedReader(new FileReader(bigram));
            File myFile = new File("newData.txt");
            Writer out = new BufferedWriter(new FileWriter(myFile));
            while (true)      
                {
                    rows = 0;
                    merge(in, out);
                }
        }

    public static void merge(BufferedReader in, Writer out) throws IOException
        {

            while (rows != 1000000)
                {
                    storedNgrams.add(in.readLine());
                    rows++;
                }

            while (!(storedNgrams.isEmpty()))
                {

                    storedParts.addAll(new ArrayList<String>(Arrays.asList(storedNgrams.get(0).split("\\s"))));

                    storedNgrams.remove(0);

                }
            while (storedParts.size() >= 8)
                {
                    System.out.println(storedParts.get(0) + " " + storedParts.get(1) + " " + storedParts.get(6)
                            + " " + storedParts.get(7));
                    if (toWritePairs.size() == 0 && storedParts.get(0).equals(storedParts.get(6))
                            && storedParts.get(1).equals(storedParts.get(7)))
                        {

                            totalFreq = Integer.parseInt(storedParts.get(3)) + Integer.parseInt(storedParts.get(9));

                            toWritePairs.add(storedParts.get(0));
                            toWritePairs.add(storedParts.get(1));

                            toWritePairs.add(Integer.toString(totalFreq));
                            storedParts.subList(0, 11).clear();

                        }
                    else if (!(toWritePairs.isEmpty()) && storedParts.get(0).equals(toWritePairs.get(0))
                            && storedParts.get(1).equals(toWritePairs.get(1)))
                        {

                            int totalFreq = Integer.parseInt(storedParts.get(3))
                                    + Integer.parseInt(toWritePairs.get(2));

                            toWritePairs.remove(2);
                            toWritePairs.add(Integer.toString(totalFreq));
                            storedParts.subList(0, 5).clear();
                        }
                    else if ((!toWritePairs.isEmpty())
                            && !(storedParts.get(0).equals(storedParts.get(6)) && storedParts.get(1).equals(
                                    storedParts.get(7))))
                        {
                            toWritePairs.add(storedParts.get(0));
                            toWritePairs.add(storedParts.get(1));
                            toWritePairs.add(storedParts.get(2));
                            storedParts.subList(0, 2).clear();
                        }

                    else if (!(toWritePairs.isEmpty()))
                        {
                            out.append(toWritePairs.get(0) + " " + toWritePairs.get(1) + " " + toWritePairs.get(2)
                                    + "\n");
                            toWritePairs.subList(0, 2).clear();

                        }

                    out.flush();
                }
        }

}

If anyone has any ideas how to improve the processing speed for these files, it would help me immensely.


Solution

  • I suggest you process the data as you go rather than reading in large amounts of data and later processing it. Its not clear from your program what information you are trying to extract/aggregate.

    Even on a fast machine, I would expect this to take about 20 seconds per file.