Search code examples
javafiletext

Duplicate word frequencies issues in Java


[I am new to Java and Stackoverflow. My last question was closed. I have added a complete code this time. thanks] I have a large txt file of 4GB (vocab.txt). It contains plain Bangla(unicode) words. Each word is in newline with its frequency(equal sign in between). Such as,

আমার=5 
তুমি=3
সে=4 
আমার=3 //duplicate of 1st word of with different frequency
করিম=8 
সে=7    //duplicate of 3rd word of with different frequency

As you can see, it has same words multiple times with different frequencies. How to keep only a single word (instead of multiple duplicates) and with summation of all frequencies of the duplicate words. Such as, the file above would be like (output.txt),

আমার=8   //5+3
তুমি=3
সে=11      //4+7
করিম=8 

I have used HashMap to solve the problem. But I think I made some mistakes somewhere. It runs and shows the exact data to output file without changing anything.

package data_correction;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.*;

import java.awt.Toolkit;
public class Main {

    public static void main(String args[]) throws Exception { 
            FileInputStream inputStream = null;
            Scanner sc = null;
            String path="C:\\DATA\\vocab.txt";
            FileOutputStream fos = new FileOutputStream("C:\\DATA\\output.txt",true);
            
            BufferedWriter bufferedWriter = new BufferedWriter(
                    new OutputStreamWriter(fos,"UTF-8"));
            try {
                System.out.println("Started!!");
                inputStream = new FileInputStream(path);
                sc = new Scanner(inputStream, "UTF-8");
                while (sc.hasNextLine()) {
                        String line = sc.nextLine();
                        line = line.trim();
                        String [] arr = line.split("=");
                        Map<String, Integer> map = new HashMap<>();
                            if (!map.containsKey(arr[0])){
                                 map.put(arr[0],Integer.parseInt(arr[1]));
                            } 
                            else{
                                 map.put(arr[0], map.get(arr[0]) + Integer.parseInt(arr[1]));
                            }

                            for(Map.Entry<String, Integer> each : map.entrySet()){
                                bufferedWriter.write(each.getKey()+"="+each.getValue()+"\n"); 
                            }

                }
                bufferedWriter.close();
                if (sc.ioException() != null) {
                    throw sc.ioException();
                }
            } finally {
                if (inputStream != null) {
                    inputStream.close();
                }
                if (sc != null) {
                    sc.close();
                }
            }
            System.out.print("FINISH");
            Toolkit.getDefaultToolkit().beep();
            }
    }

Thanks for your time.


Solution

  • This should do what you want with some mor eJava magic:

        public static void main(String[] args) throws Exception {
            String separator = "=";
            Map<String, Integer> map = new HashMap<>();
            try (Stream<String> vocabs = Files.lines(new File("test.txt").toPath(), StandardCharsets.UTF_8)) {
                vocabs.forEach(
                        vocab -> {
                            String[] pair = vocab.split(separator);
                            int value = Integer.valueOf(pair[1]);
                            String key = pair[0];
                            if (map.containsKey(key)) {
                                map.put(key, map.get(key) + value);
                            } else {
                                map.put(key, value);
                            }
                        }
                );
            }
            System.out.println(map);
        }
    

    For test.txt take the correct file path. Pay attention that the map is kept in memory, so this is maybe not the best approach. If necessary replace the map with a e.g. database backed approach.