Search code examples
javafilecomparison

Java / Javascript : File Comparison line by line while ignoring certain section


QUESTION : Is there a better way to compare two low size(100Kb) files, while selectively ignoring a certain portion of text. and report differences

Looking for default/existing java libraries or any windows native apps

Below is scenario:

Expected file 1 located at D:\expected\FileA_61613.txt  
..Actual file 2 located at D:\actuals\FileA_61613.txt

Content in expected File

Some first line here


There may be whitespaces, line breaks, indentation and here is another line

Key        : SomeValue
Date       : 01/02/2012
Time       : 18:20
key2       : Value2
key3       : Value3
key4       : Value4
key5       : Value5

Some other text again to indicate that his is end of this file.

Actual File to be compared:

Some first line here


There may be whitespaces, line breaks, indentation and here is another line

Key        : SomeValue
Date       : 18/09/2013
Timestamp  : 15:10.345+10.00
key2       : Value2
key3       : Value3
key4       : Something Different
key5    : Value5


Some other text again to indicate that his is end of this file.

File 1 and 2 need to be compared line by line., WITHOUT ignoring
whitespaces, indentation, linebreaks

The comparison result should be like something below:
Line 8 - Expected Time, but actual Timestamp
Line 8 - Expected HH.mm, but actual HH.mm .345+10.00
Line 10 - Expected Value4, but actual Something different.
Line 11 - Expected indentation N spaces, but actual only X spaces
Line 13 - Expected a line break, but no linebreak present.

Below have also changed but SHOULD BE IGNORED :
Line 7 - Expected 01/02/2012, but actual 18/09/2013 (exactly and only the 10chars)
Line 8 - Expected 18:20 but actual :15:20 (exactly and only 5 chars should be ignored)
Note : The remaining .345+10.00 should be reported

It is fine even if result just contains the line numbers and no analysis of why it failed.
But it should not just report a failure at line 8 and exit.
It should report all the changes, except for the excluded "date" and "time" values.

Some search results pointed to solutions using Perl.
But Looking for Java / Javascript solutions. The inputs to the solution would be full file path to both the files.

My current work-around:
Replace the text to be ignored with '#'. When performing comparison, if we encounter #, do not consider as difference. Below is my working code. But I need to know if i can use some default / existing libraries or functions to achieve this.

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;

public class fileComparison {
    public static void main(String[] args) throws IOException {
        FileInputStream fstream1 = new FileInputStream(
                "D:\\expected\\FileA_61613.txt");
        FileInputStream fstream2 = new FileInputStream(
                "D:\\actuals\\FileA_61613.txt");
        DataInputStream in1 = new DataInputStream(fstream1);
        BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
        DataInputStream in2 = new DataInputStream(fstream2);
        BufferedReader br2 = new BufferedReader(new InputStreamReader(in2));
        int lineNumber = 0;
        String strLine1 = null;
        String strLine2 = null;
        StringBuilder sb = new StringBuilder();
        System.out.println(sb);
        boolean isIgnored = false;

        while (((strLine1 = br1.readLine()) != null)
                && ((strLine2 = br2.readLine()) != null)) {
            lineNumber++;
            if (!strLine1.equals(strLine2)) {
                int strLine1Length = strLine1.length();
                int strLine2Length = strLine2.length();
                int maxIndex = Math.min(strLine1Length, strLine2Length);
                if (maxIndex == 0) {
                    sb.append("Mismatch at line " + lineNumber
                            + " all characters " + '\n');
                    break;
                }
                int i;
                for (i = 0; i < maxIndex; i++) {
                    if (strLine1.charAt(i) == '#') {
                        isIgnored = true;
                        continue;
                    }
                    if (strLine1.charAt(i) != strLine2.charAt(i)) {
                        isIgnored = false;
                        break;
                    }
                }
                if (isIgnored) {
                    sb.append("Ignored line " + lineNumber + '\n');
                } else {
                    sb.append("Mismatch at line " + lineNumber + " at char "
                            + i + '\n');
                }
            }
        }
        System.out.println(sb.toString());
        br1.close();
        br2.close();

    }
}

I am able to get the output as :

Ignored line 7
Mismatch at line 8 at char 4
Mismatch at line 11 at char 13
Mismatch at line 12 at char 8
Mismatch at line 14 all characters 

However, when there are multiple differences in same line. I am not able to log them all, because i am comparing char by char and not word by word.
I did not prefer word by word comparison because, i thought it would not be possible to compare linebreaks, and whitespaces. Is my understanding right ?


Solution

  • java.lang.StringIndexOutOfBoundsException comes from this code:

    for (int i = 0; i < strLine1.length(); i++) {
       if (strLine1.charAt(i) != strLine2.charAt(i)) {
           System.out.println("char not same at " + i);
       }   
    }
    

    When you scroll larger String strLine to an index, that is greater than the length of strLine2 (second file is smaller than the first) you get that exception. It comes, because strLine2 does not have values on those indexes when it is shorter.