Search code examples
regexshellnotepad++jsfiddleregex-group

Search for pattern. Put it in buffer and do sorting using regex ( Notepad++ or cygwin Shell ) or JSFiddle


I'd like to identify certain pattern and move entire lines to a specific part of a file thereby re-arranging the file contents should you say. I prefer a notepad++ solution but if you think that is too complex then a cygwin shell ( awk ) or JSfiddle also works I will make my point with example below

    I have a pattern that is 
"col<variable space>stat<variable space>col ( axx,bvb,ccc) on mr.dan"  (<some word> confidence)
e.g. 
"col  stat  col ( a123,b6949,c4433) on Mr.Randy"  (Low confidence) 
"col         stat       col     ( a1fddf23, b6ff949,c4433 ) on    John.Doe  "  (Low confidence) 
"col     stat   col     ( ax ) on    John.Dane  "  (Ok confidence) 
"col stat col ( axdf,fsdds ) on    Jane.Dame "  (  Fair confidence ) 

What it should do

  • Get rid of all the quotes , rid of the (<word> confidence) part and stick a ";" at the end of line ( I can manage this part and dont need help here )
  • the expression col ( axdf,fsdds ) of the with pattern

col\s+(\s*word1\s*,\s*word2\s*,\s*wordN\s*)\s*on\s*word.word\s*


Above pattern need to be re-arranged so that ones with one word col ( word) will come on top , followed by two words col ( word1, word2) and so on in the ascending order of the number of words in col ( word ) expression
So out put of the above should be

col     stat   col     ( ax ) on    John.Dane  ;    # 1 word in col (word) expr 
col stat col ( axdf,fsdds ) on    Jane.Dame ;     # 2 words in col (word) expr 
col         stat       col     ( a1fddf23, b6ff949,c4433 ) on    John.Doe  ;    ; # 3 words in col (word) expr 
col  stat  col ( a123,b6949,c4433) on Mr.Randy; 

What I did
I could get the 1st part done using "\s*\((\s*(\w+)*\s*Confidence\)) replace with ;

I need help with the 2nd part the col ( word) expression rearrange.
logical pseudocode for Notepad++ would be first two isolate the wordlist in each of those column expressions in separate buffers. next you count the number of words in each buffer and then arrange the buffers. based on the buffer arrangement you lineup the expressions.
Also open to JsFiddle or Shellscript regex / awk


Solution

  • This can't be done with Notepad++, I suggest to use a script, here an example of Perl script that does the job.

    The whole file is read in memory, it will be a problem if he file is very large.

    #!/usr/bin/perl
    use Modern::Perl;
    
    # Read input file in an array
    my $file_in = 'file.txt';
    open my $fh, '<', $file_in or die "unable to open '$file_in': $!";
    my @lines = <$fh>;
    
    # Replace last quote until end of line with semicolon and remove quotes
    my @unsorted = map { s/"[^"]*$/;/; s/"//g; $_ } @lines; 
    
    # use Schartzian transform for sorting
    my @sorted = 
        # remove the number of words
        map  { $_->[0] }
        # sort on number of words
        sort { $a->[1] <=> $b->[1] }
        # Add number of words
        map  { 
            # list of words inside parenthesis
            my ($words) = $_ =~ /\(([^)]+)\)/;
            # split to have number of words
            my @w = split',', $words;
            # add this number as second element in array
            [$_, scalar @w] 
        }
        @unsorted;
    
    # Write into output file
    my $file_out = 'file_out.txt';
    open my $fh_out, '>', $file_out or die "unable to open '$file_out': $!";
    say $fh_out $_ for @sorted;
    

    Output file:

    col     stat   col     ( ax ) on    John.Dane  ;
    col stat col ( axdf,fsdds ) on    Jane.Dame ;
    col  stat  col ( a123,b6949,c4433) on Mr.Randy;
    col         stat       col     ( a1fddf23, b6ff949,c4433 ) on    John.Doe  ;