Search code examples
linuxreplacesedlarge-files

Search and replace in a large single line file (~2GB) file in Linux


I have a large XML file which is approximately 2GB in size. To make things interesting the entire data is in a single line.

I am trying to insert a newline character at the end of specific tags in this file to make it a multiline file which will allow me to split it and do more with it.

root@server:~# sed -i -e 's/\<\/Dummy\>/\<\/Dummy\>\\\n/g' file_name

I've tried sed, vi and joe with no luck. The length of each node in the XML is different so I cannot split the file based on number of characters.

Is there a way for me to make this large single line file into a multiline file via the command line?


Solution

  • I think I would actually do this with gawk rather than sed.

    You haven't included input data, so I'll make some up.

    $ printf '<a><b></b><b></b></a><a><c></c></a>' | gawk -vRS='</a>' '{print $0 RS}'
    <a><b></b><b></b></a>
    <a><c></c></a>
    

    Normally, awk (or gawk) will consider each line to be a unique record, with each line split into fields delimited by whitespace.

    If instead you split records by some XML tag, you can rely on the fact that print will append a newline as the default ORS (output record separator) after printing each "input record".

    Unlike a sed solution which will attempt to read one entire "record" (line) into memory in order to perform actions on it, I suspect that this solution would step through your file only using enough memory to "remember" the space between record separators. (This addresses the "large file" concern.)

    Three other things to note.

    First, a record separator is NOT a concept native to XML, so any solution using sed, awk, or anything that does not natively interpret XML is a hack. You will always get better results using tools which natively support your data format.

    Second, since in my example I've specificed a record separator with that is the close of an XML tag, the input data could be though to have THREE RECORDS, the third of which is null. If you have a newline after your final "record separator", that third record may be terminated with yet another RS in your output. Be warned. This is the result of thing #1.

    Third, this is a gawk solution, not an awk solution, because other awk implementations generally do not support multiple characters as record separators.

    YMMV. This is not a great solution, but it may be sufficient for your needs.