[I am new to Java and Stackoverflow. My last question was closed. I have added a complete code this time. thanks] I have a large txt file of 4GB (vocab.txt). It contains plain Bangla(unicode) words. Each word is in newline with its frequency(equal sign in between). Such as,
আমার=5
তুমি=3
সে=4
আমার=3 //duplicate of 1st word of with different frequency
করিম=8
সে=7 //duplicate of 3rd word of with different frequency
As you can see, it has same words multiple times with different frequencies. How to keep only a single word (instead of multiple duplicates) and with summation of all frequencies of the duplicate words. Such as, the file above would be like (output.txt),
আমার=8 //5+3
তুমি=3
সে=11 //4+7
করিম=8
I have used HashMap to solve the problem. But I think I made some mistakes somewhere. It runs and shows the exact data to output file without changing anything.
package data_correction;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.*;
import java.awt.Toolkit;
public class Main {
public static void main(String args[]) throws Exception {
FileInputStream inputStream = null;
Scanner sc = null;
String path="C:\\DATA\\vocab.txt";
FileOutputStream fos = new FileOutputStream("C:\\DATA\\output.txt",true);
BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(fos,"UTF-8"));
try {
System.out.println("Started!!");
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
line = line.trim();
String [] arr = line.split("=");
Map<String, Integer> map = new HashMap<>();
if (!map.containsKey(arr[0])){
map.put(arr[0],Integer.parseInt(arr[1]));
}
else{
map.put(arr[0], map.get(arr[0]) + Integer.parseInt(arr[1]));
}
for(Map.Entry<String, Integer> each : map.entrySet()){
bufferedWriter.write(each.getKey()+"="+each.getValue()+"\n");
}
}
bufferedWriter.close();
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
System.out.print("FINISH");
Toolkit.getDefaultToolkit().beep();
}
}
Thanks for your time.
This should do what you want with some mor eJava magic:
public static void main(String[] args) throws Exception {
String separator = "=";
Map<String, Integer> map = new HashMap<>();
try (Stream<String> vocabs = Files.lines(new File("test.txt").toPath(), StandardCharsets.UTF_8)) {
vocabs.forEach(
vocab -> {
String[] pair = vocab.split(separator);
int value = Integer.valueOf(pair[1]);
String key = pair[0];
if (map.containsKey(key)) {
map.put(key, map.get(key) + value);
} else {
map.put(key, value);
}
}
);
}
System.out.println(map);
}
For test.txt
take the correct file path. Pay attention that the map is kept in memory, so this is maybe not the best approach. If necessary replace the map with a e.g. database backed approach.