Search code examples
pythonregexdiffaraxis

Automating a directory diff while ignoring some particular lines in files


I need to compare two directories, and produce some sort of structured output (text file is fine) of the differences. That is, the output might looks something like this:

file1 exists only in directory2
file2 exists only in directory1
file3 is different between directory1 and directory2

I don't care about the format, so long as the information is there. The second requirement is that I need to be able to ignore certain character sequences when diffing two files. Araxis Merge has this ability: you can type in a Regex and any files whose only difference is in character sequences matching that Regex will be reported as identical.

That would make Araxis Merge a good candidate, BUT, as of yet I have found no way to produce a structured output of the diff. Even when launching consolecompare.exe with command-line argumetns, it just opens an Araxis GUI window showing the differences.

So, does either of the following exist?

  • A way to get Araxis Merge to print a diff result to a text file?
  • Another utility that do a diff while ignoring certain character sequences, and produce structured output?

Extra credit if such a utility exists as a module or plugin for Python. Please keep in mind this must be done entirely from a command line / python script - no GUIs.


Solution

  • To some extent, the plain old diff command can do just that, i.e. compare directory contents and ignoring changes that match a certain regex pattern (Using the -I option).

    From man bash:

    -I regexp
          Ignore changes that just insert or delete lines that match  regexp.
    

    Quick demo:

    [me@home]$ diff images/ images2
    Only in images2: x
    Only in images/: y
    diff images/z images2/z
    1c1
    < zzz
    ---
    > zzzyy2
    
    [me@home]$ # a less verbose version
    [me@home]$ diff -q images/ images2
    Only in images2: x
    Only in images/: y
    Files images/z and images2/z differ
    
    [me@home]$ # ignore diffs on lines that contain "zzz"
    [me@home]$ diff -q -I ".*zzz.*" images/ images2/
    Only in images2/: x
    Only in images/: y