Search code examples
c#.netasp.net-corestring-comparison

word comparison in a line of text in c#


Hi I am using c# language in my project and I am trying to get output something like below.

 string str1 = "Cat meet's a dog has";
 string str2 = "Cat meet's a dog and a bird";

 string[] str1Words = str1.ToLower().Split(' ');
 string[] str2Words = str2.ToLower().Split(' ');

 var uniqueWords = str2Words
   .Except(str1Words)
   .Concat(str1Words.Except(str2Words))
   .ToList();

This gives me out put has,and ,a, bird which is correct but what i would like is something like below

has - present in first string not present in second string

and a bird - not present in first string but present in second string

For example, second user case

String S1 = "Added"
String S2 = "Edited"

here out put should be

Added - present in first string not present in second string

Edited - not present in first string but present in second string

I would like to have some indication which is present in first and not in second, present in second and not in first and comparison should be word by word rather than character by character. Can someone please help me with this. Any help would be appreciated. Thanks


Solution

  • I suggest matching words

    Let word be a sequence of letters and apostrophes

    with a help of regular expression (please, note that splitting doesn't take punctuation into account and thus cat cat, and cat! will be considered three different words) and then query matches for two given strings:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text.RegularExpressions; 
    
    ...
    
    private static readonly Regex WordsRegex = new Regex(@"[\p{L}']+"); 
    
    // 1 - in text1, 2 - in text2, 3 - in both text1 and text2 
    private static List<(string word, int presentAt)> MyWords(string text1, string text2) {
      HashSet<string> words1 = WordsRegex
        .Matches(text1)
        .Cast<Match>()
        .Select(match => match.Value)
        .ToHashSet(StringComparer.OrdinalIgnoreCase);
    
      HashSet<string> words2 = WordsRegex
        .Matches(text2)
        .Cast<Match>()
        .Select(match => match.Value)
        .ToHashSet(StringComparer.OrdinalIgnoreCase);
    
      return words1
        .Union(words2)
        .Select(word => (word, presentAt: (words1.Contains(word) ? 1 : 0) | 
                                          (words2.Contains(word) ? 2 : 0)))
        .ToList();
    }
    

    Demo:

    string str1 = "Cat meet's a dog has";
    string str2 = "Cat meet's a dog and a bird";
        
    var result = MyWords(str1, str2);
        
    var report = string.Join(Environment.NewLine, result);
        
    Console.Write(report);
    

    Output:

    (Cat, 3)         # 3: in both str1 and str2 
    (meet's, 3)      # 3: in both str1 and str2
    (a, 3)           # 3: in both str1 and str2
    (dog, 3)         # 3: in both str1 and str2 
    (has, 1)         # 1: in str1 only
    (and, 2)         # 2: in str2 only
    (bird, 2)        # 2: in str2 only 
    

    Fiddle

    If you want a wordy output:

    string str1 = "Cat meet's a dog has";
    string str2 = "Cat meet's a dog and a bird";
        
    string[] options = new string[] {
      "not present",
      "present in first string not present in second string",
      "not present in first string but present in second string",
      "present in first string and present in second string"
    };
            
    var report = string.Join(Environment.NewLine, result
      .Select(pair => $"{pair.word} - {options[pair.presentAt]}"));
    
    Console.Write(report);
    

    Output:

    Cat - present in first string and present in second string
    meet's - present in first string and present in second string
    a - present in first string and present in second string
    dog - present in first string and present in second string
    has - present in first string not present in second string
    and - not present in first string but present in second string
    bird - not present in first string but present in second string