Search code examples
regexsedregex-group

Substituting text that overlaps


I have a piece of data (data.txt) that, due to user fault, looks like that:

4,48
4485
4,49
4495
4,5
4505
4,51
 4,6
11445
11,45

The pattern is this: whenever there is a comma, 0s have been dropped. So: 4450 was improperly changed to 4,45, 4600 was changed to 4,6; and 11450 changed to 11,45.

So, two actions should be performed when a comma is found:

  1. Add one or two 0 on the right, to get three digits right of the comma: d,dd -> d,dd0 ; or d,d -> d,d00
  2. Delete the comma ddd0 ; dd00

The end result should be:

4480
4485
4490
4495
4500
4505
4510
4600
11445
11450

How could I use a regex on sed (or another program) to get this result?

  1. One solution would involve splitting the data in two files, dataa.txt and datab.txt:

dataa.txt:

4,48
4485
4,49
4495
4,5
4505
4,51
 4,6
11445
11,45

and datab.txt:

4,5
4,6

For the first file:

$ sed -E 's/(\,[0-9][0-9])/\10/g;s/\,//g' dataa.txt

and for the second file:

$ sed -E 's/(\,[0-9])/\100/g;s/\,//g' datab.txt 

Then, concatenate the files. It would be better to do that without these extra steps (spliting and concatenating).

  1. There are very good solutions using awk (thank you!), and one is reproduced below:

    $ awk '{gsub(/,/, ""); printf "%.4s\n", $0 * 1000}' data.txt

But when dealing with 5 digit numbers (you can spot them for the number of digits on the left of the comma) it also does not work. It would also would require spliting the data.

How could we achieve the end result, without splitting the data?

(edited for clarity)


Solution

  • First make sure you have enough digits after the comma. Next cut everything after the third decimal and remove the comma:

    sed -r 's/(,.*)/\1000/; s/,(...).*/\1/ ' data.txt
    

    Note: the \1000 is remembering matched string 1 with \1 and adding 000.