Search code examples
bashsedopenstreetmaptext-processinglarge-files

How to search for any member of a list of values with sed


not sure how to ask the question appropriately, but here's the use case:

  • I have an ~18GB XML file (OpenStreetMap); ~250M lines
  • The file has ~250 offending entries that are corrupting the dataset
  • The entries to be removed are multiline & of form: <way id="foo">... <\way>
  • I have those ids in a file bad_ways

I could write a for loop & cycle through a bunch of sed statements like this:

sed -i.bu '/<way id="1_bad_way_entry".*/,/<\/way>/d' in.xml

but... this requires ~250 cycles through an 18G file & associated disk writes, etc., which right now takes about 18min per cycle (spinning disk... will fix that shortly by switching machines. Update: SSD improves to about 6.5 min per cycle).

Is there any way to ask sed to match any entry in bad_ways and do this in 1 pass?

Or, are there better tools for this than sed? Thanks in advance!


Solution

  • You can use command substitution to assemble the sed script on the run.

    (Note: in the following I use sed's -E option to save some backslash; if you don't you have to create the sed script by including the backslashes as needed.)

    For instance, assuming the bad_ways file is like this:

    one
    two
    three
    

    and that the huge_file is like this:

    everything starts with a zero, then one is next, then two, then three, finally four
    

    you can accomplish the task with the following command to substitute all patterns listed in bad_ways with XXX:

    sed -E 's/'"$(sed -zE 's/\n([^$])/|\1/g' bad_ways)"'/XXX/g' huge_file
    

    Then output is

    everything starts with a zero, then XXX is next, then XXX, then XXX, finally four
    

    As you can see, the sed script that acts on the huge_file is made up by concatenating three strings:

    1. s/ which is single quoted (you should always prefer single quotes, unless you need double quotes, as in 2.)
    2. the output of sed -zE 's/\n([^$])/|\1/g' bad_ways, which is double quoted to allow command substitution, and which generates one|two|three
    3. /XXX/g.

    All this results in the string s/one|two|three/XXX/g.

    This is not clearly the string that you need for your script, but I hope this answer shows you an example of how to use command substitution $(…) and appropriate quoting with ' and " to craft a command (sed, awk or whatever) dynamically.

    In hindsight this answer is based on the same "philosophy" as the one in the answer linked off a comment. However I'm not temporary saving the script to a file. This could be of minor importance if the script itself is small (and it is small, based on your description).