Search code examples
javaalgorithmsortingfileparsing

Parse Whatsapp Log-File in Java


I´m currently working on a little tool which analyses the usage of a group-chat in Whatsapp.

i´m trying to realize it with the whatsapp logfile. I managed it to format the raw .txt to the following format to work with the formated text:

29. Jan. 12:01 - Random Name: message text
29. Jan. 12:22 - Random Name: message text
29. Jan. 12:24 - Random Name: message text
29. Jan. 12:38 - Random Name: message text
29. Jan. 12:52 - Random Name: message text

so far, so good. The Problem is that there are a few floppy lines like:

29. Jan. 08:42 - Random Name2: message text 1
                 additional text of the message 1
29. Jan. 08:43 - Random Name2: message text 2

or even worse:

15. Jan. 14:00 - Random Name: First part of the message
                 second part
                 third part
                 forth part
                 fifth part    
29. Jan. 08:43 - Random Name2: message text 2

I guess I need a kind of algorythm to solve this problem, but i´m pretty new in programming and can´t create such a complex algorithm.

The same problem in Python: parse a whatsApp conversation log

[EDIT]

This is my code which doesn´t work. (I know it´s pretty bad)

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

public class FormatList {

    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
        FileReader fr = new FileReader("Whatsapp_formated.txt");
        BufferedReader br = new BufferedReader(fr);

        FileWriter fw = new FileWriter("Whatsapp_formated2.txt");
        BufferedWriter ausgabe = new BufferedWriter(fw);

        String line="";
        String buffer="";

        while((line = br.readLine())!=null)
        {
            System.out.println("\n"+line);

            if(line.isEmpty())
            {

            }
            else{
                if(line.charAt(0)=='0'||line.charAt(0)=='1'||line.charAt(0)=='2'||line.charAt(0)=='3'||line.charAt(0)=='4'||line.charAt(0)=='5'||line.charAt(0)=='6'||line.charAt(0)=='7'||line.charAt(0)=='8'||line.charAt(0)=='9')
                {
                    buffer = line;

                }
                else
                {
                    buffer += line;
                }

                 ausgabe.write(buffer);
                 ausgabe.newLine();
                System.out.println(buffer);
            }

            ausgabe.close();

        }




    }

}

[EDIT 2]

In the end i want to read out the file and analyse each line:

29. Jan. 12:01 - Random Name: message text

I can tell when it was sent, who sent it and what/how much he wrote

If i now get the following line:

additional text of the message 1

I neither can tell when it was written nor who sent it


Solution

  • well, I came up with a solution for your problem, I believe, according to what I understood.

    Given a file with this format:

    29. Jan. 12:01 - Random Name: message text
    29. Jan. 12:22 - Random Name: message text
    29. Jan. 12:24 - Random Name: message text
    29. Jan. 12:38 - Random Name: message text
    29. Jan. 12:52 - Random Name: message text
    29. Jan. 08:42 - Random Name2: message text 1
                     additional text of the message 1
    29. Jan. 08:43 - Random Name2: message text 2
    15. Jan. 14:00 - Random Name: First part of the message
                     second part
                     third part
                     forth part
                     fifth part    
    29. Jan. 08:43 - Random Name2: message text 2
    

    (This is a file called "wsp.log" in my "data" folder. So the path to access to it is "data/wsp.log")

    I expect something like this:

    29. Jan. 12:01 - Random Name: message text
    29. Jan. 12:22 - Random Name: message text
    29. Jan. 12:24 - Random Name: message text
    29. Jan. 12:38 - Random Name: message text
    29. Jan. 12:52 - Random Name: message text
    29. Jan. 08:42 - Random Name2: message text 1 additional text of the message 1
    29. Jan. 08:43 - Random Name2: message text 2
    15. Jan. 14:00 - Random Name: First part of the message second part third part forth part fifth part
    29. Jan. 08:43 - Random Name2: message text 2
    

    According to that, I implemented the following class:

    public class LogReader {
    
        public void processWspLogFile() throws IOException {
            //a. I would reference to my file
            File wspLogFile = new File("data/wsp.log");
            //b. I would use the mechanism to read the file using BufferedReader
            BufferedReader bufferedReader = new BufferedReader(new FileReader(wspLogFile));
    
            String currLine = null;//This is the current line (like my cursor)
    
            //This will hold the data of the file in String format
            StringBuilder stringFormatter = new StringBuilder();
            boolean firstIterationDone = false;//The first line will always contains the format, so I will always append it, from the second I will start making the checkings...
    
            // Now I can use some regex (I'm not really good at this stuff, I just used a Web Page: http://txt2re.com/)
            /* This regex will match the lines that contains the date in this format "29. Jan. 12:22", when I take a look at your file
              I can see that the "additional text of the message" does not contains any date, so I can use that as my point of separation*/
            String regex = "(\\d)(\\d)(\\.)(\\s+)([a-z])([a-z])([a-z])(\\.)(\\s+)(\\d)(\\d)(:)(\\d)(\\d)";
            //As part of using regex, I would like to create a Pattern to make the lines on the list match this expression      
            Pattern wspLogDatePattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    
            //Use of the line separator of the O.S
            String lineSeparator = System.getProperty("line.separator");
    
            while ((currLine = bufferedReader.readLine()) != null) {
    
                if (!firstIterationDone) {
                    stringFormatter.append(currLine);
                    firstIterationDone = true;
                } else {
                    Matcher wspLogDateMatcher = wspLogDatePattern.matcher(currLine);    
    
                    //The first time we will check if the second line has the pattern, if it does, we append a line separator
                    if (wspLogDateMatcher.find()) {
                        //It is a "normal" line
                        stringFormatter.append(lineSeparator).append(currLine);             
                    } else {
                        //But if it doesn't, we append it on the same line
                        stringFormatter.append(" ").append(currLine.trim());
                    }
                }
            }
            System.out.println(stringFormatter.toString());
        }
    }
    

    Which I will invoke this way:

    public static void main(String[] args) throws IOException {
        new LogReader().processWspLogFile();
    }
    

    Hope this can give you some idea or can be useful for your purposes. I know some improvements are needed, refactor is always needed for code :), but by now it can achieve the format expected. Happy coding :).