Search code examples
c#.netregexperlpcre

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row


I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).

Example: {FolderLoc = "C:\testC:\test"}

I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.

I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.

Any help would be appreciated.


Solution

  • Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like

    /(.+)(?=\1)/;  # but need more restrictions
    

    However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.

    Here is a basic and raw example. Please also see the note on regex at the end.

    use warnings;
    use strict;
    use feature 'say';
    
    my @lines = (
        q(It just wasn't able just wasn't able no matter how hard it tried.),
        q(This has no repetitions.),
        q({FolderLoc = "C:\testC:\test"}),
    );
    
    my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/;  # at least two words, and then some
    
    for (@lines) { 
        if (/$re_rep/) {
            # Other conditions/filtering on $1 (the capture) ?
            say $1
        } 
    }
    

    This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.

    The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.

    The program above prints

    just wasn't able 
    C:\test
    

    Note on regex   This quest, to find repeated text, is much too generic as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.

    So this needs further specialization, for what we need to know about data.