I'm writing an application to help improve machine translations for my dissertation. For this, I require huge amount of ngram data. I've got the data from Google, but it's not in a useful format.
Here's how Google's data is formatted:
ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
Here's what I'm after:
ngram total_match_count_for_all_years
So, I've written a small application to run through the files and pull out the ngrams and aggregate the data over multiple years to get the total count. It, so it seems, runs fine. But, since the Google files are so big (1.5GB each! There's 99 of them >.<) it's taking a long time to get through them all.
Here's the code:
public class mergeData
{
private static List<String> storedNgrams = new ArrayList<String>(100001);
private static List<String> storedParts = new ArrayList<String>(100001);
private static List<String> toWritePairs = new ArrayList<String>(100001);
private static int rows = 0;
private static int totalFreq = 0;
public static void main(String[] args) throws Exception
{
File bigram = new File("data01");
BufferedReader in = new BufferedReader(new FileReader(bigram));
File myFile = new File("newData.txt");
Writer out = new BufferedWriter(new FileWriter(myFile));
while (true)
{
rows = 0;
merge(in, out);
}
}
public static void merge(BufferedReader in, Writer out) throws IOException
{
while (rows != 1000000)
{
storedNgrams.add(in.readLine());
rows++;
}
while (!(storedNgrams.isEmpty()))
{
storedParts.addAll(new ArrayList<String>(Arrays.asList(storedNgrams.get(0).split("\\s"))));
storedNgrams.remove(0);
}
while (storedParts.size() >= 8)
{
System.out.println(storedParts.get(0) + " " + storedParts.get(1) + " " + storedParts.get(6)
+ " " + storedParts.get(7));
if (toWritePairs.size() == 0 && storedParts.get(0).equals(storedParts.get(6))
&& storedParts.get(1).equals(storedParts.get(7)))
{
totalFreq = Integer.parseInt(storedParts.get(3)) + Integer.parseInt(storedParts.get(9));
toWritePairs.add(storedParts.get(0));
toWritePairs.add(storedParts.get(1));
toWritePairs.add(Integer.toString(totalFreq));
storedParts.subList(0, 11).clear();
}
else if (!(toWritePairs.isEmpty()) && storedParts.get(0).equals(toWritePairs.get(0))
&& storedParts.get(1).equals(toWritePairs.get(1)))
{
int totalFreq = Integer.parseInt(storedParts.get(3))
+ Integer.parseInt(toWritePairs.get(2));
toWritePairs.remove(2);
toWritePairs.add(Integer.toString(totalFreq));
storedParts.subList(0, 5).clear();
}
else if ((!toWritePairs.isEmpty())
&& !(storedParts.get(0).equals(storedParts.get(6)) && storedParts.get(1).equals(
storedParts.get(7))))
{
toWritePairs.add(storedParts.get(0));
toWritePairs.add(storedParts.get(1));
toWritePairs.add(storedParts.get(2));
storedParts.subList(0, 2).clear();
}
else if (!(toWritePairs.isEmpty()))
{
out.append(toWritePairs.get(0) + " " + toWritePairs.get(1) + " " + toWritePairs.get(2)
+ "\n");
toWritePairs.subList(0, 2).clear();
}
out.flush();
}
}
}
If anyone has any ideas how to improve the processing speed for these files, it would help me immensely.
I suggest you process the data as you go rather than reading in large amounts of data and later processing it. Its not clear from your program what information you are trying to extract/aggregate.
Even on a fast machine, I would expect this to take about 20 seconds per file.