Search code examples
c#asp.net-coretext-comparisongoogle-diff-match-patch

Text comparison using diff library in .net


Hi I am working in text comparison component using diff library. I have below three strings

string s1 = "the Charterers as follows: ";
string s2 = "the Charterers as follows: added";
string s3 = "the Charterers as follows: added new content again";

String S1 is the initial string. In String S2 I have inserted text added(during first change) In String S3 I have inserted text new content again (during second iteration)

In final output need to know the last modified text that is added new content again. So in final output should be something like this

[
    {
        "text": "the Charterers as follows: ",
        "operation": 2
    },
    {
        "text": "added",
        "operation": 1
    }
    {
        "text": "new content again",
        "operation": 4
    }
]

In this above output if we combine all the text content then final output appear

the Charterers as follows: added new content again

so this should be the final output. Currently Operation 4 is not supported in this library. Library has operation codes

0 - DELETE 1- INSERT 2- EQUAL

So how we can apply logic so that we can come to know last modified content using these three strings using diff library. So how we can get output like above using given three above strings

Below is the sample code for comparing 2 strings using library

            string s1 = "the Charterers as follows: ";
            string s2 = "the Charterers as follows: added";
            string s3 = "the Charterers as follows: added new content again";
            diff_match_patch dmp = new diff_match_patch();
            List<NewDiff> newDiffList = new List<NewDiff>();
            List<Diff> diff = dmp.diff_main($"{s1}", $"{s3}");
            dmp.diff_cleanupSemantic(diff);
            for (int k = 0; k < diff.Count; k++)
            {
                NewDiff newDiff = new NewDiff();
                newDiff.Operation = diff[k].operation;
                newDiff.Text = diff[k].text;
                newDiffList.Add(newDiff);
            }

How we can enhance this code to identify last modified content? can someone please help me with this? Any help would be appreciated. Thanks


Solution

  • To achieve your goal of identifying the last modified content using the diff library, you will need to compare the strings in a sequential manner and track the changes across iterations. library doesn't support a direct way to identify only the last changes, you would need to implement additional logic on top of the diff results to track the modifications.

    What you can do is:

    1)Compare s1 with s2 to find the difference

    2)Compare s2 with s3 to find the second difference

    3)Combine the difference from both comparisons to reconstruct the full change history.

    Here is the sample code you could try:

    string s1 = "the Charterers as follows: ";
    string s2 = "the Charterers as follows: added";
    string s3 = "the Charterers as follows: added new content again";
    
    diff_match_patch dmp = new diff_match_patch();
    List<NewDiff> combinedDiffList = new List<NewDiff>();
    
    List<Diff> diff1 = dmp.diff_main(s1, s2);
    dmp.diff_cleanupSemantic(diff1);
    
    foreach (Diff aDiff in diff1)
    {
        combinedDiffList.Add(new NewDiff() { Operation = aDiff.operation, Text = aDiff.text });
    }
    
    List<Diff> diff2 = dmp.diff_main(s2, s3);
    dmp.diff_cleanupSemantic(diff2);
    
    foreach (Diff aDiff in diff2)
    {
        switch (aDiff.operation)
        {
            case 0: // EQUAL
                // Do nothing for equal text
                break;
            case 1: // INSERT
                combinedDiffList.Add(new NewDiff() { Operation = aDiff.operation, Text = aDiff.text });
                break;
            case -1: // DELETE
                // For deletions, find and update the corresponding text in the combinedDiffList
                var existingDiff = combinedDiffList.FirstOrDefault(d => d.Text.Contains(aDiff.text));
                if (existingDiff != null)
                {
                    // Update the existing entry to reflect the deletion
                    existingDiff.Text = existingDiff.Text.Replace(aDiff.text, "");
                    if (string.IsNullOrEmpty(existingDiff.Text))
                    {
                        // If all text from an entry is deleted, remove the entry
                        combinedDiffList.Remove(existingDiff);
                    }
                }
                else
                {
                    // If the deleted text does not exist in the combined list, add it as a deletion
                    combinedDiffList.Add(new NewDiff() { Operation = aDiff.operation, Text = aDiff.text });
                }
                break;
        }
    }
    
    // The combinedDiffList now contains the final set of differences