Search code examples
sortingawkuniquemediawiki

sort words per section


I have this text file that I need to sort per section.

#cat raw_file.txt

== other info ==
===instructions===
===english words===
this
is
only
test


=== missing words ===

==== include words ====
some
more
words

==== customer name ====
ram
sham
amar
akbar
anthony

==== cities ====
mumbai
delhi
pune


=== prefix ===

the
a
an

If I sort it "as is" then it starts with 2 equal signs followed by 3 equal signs and then all the words. How do I sort the words per section separately?

# sort raw_file.txt

== other info ==
=== missing words ===
=== prefix ===
==== cities ====
==== customer name ====
==== include words ====
===english words===
===instructions===
a
akbar
amar
an
anthony
delhi
is
more
mumbai
only
pune
ram
sham
some
test
the
this
words

This is mediawiki format if that matters. I am sorting each and every section and that is taking a lot of time.

#cat expected_output.txt

== other info ==
===instructions===
===english words===
is
only
test
this

=== missing words ===

==== include words ====
more
some
words

==== customer name ====
akbar
amar
anthony
ram
sham

==== cities ====
delhi
mumbai
pune

=== prefix ===
a
an
the

Solution

  • If you're not worried about keeping the blank lines you could use:

    awk '/=/ {c++} {print c+1, $0}' file.txt | sort -n | cut -d' ' -f2- | sed '/^$/d'
    >== other info ==
    >===instructions===
    >===english words===
    >is
    >only
    >test
    >this
    >=== missing words ===
    >==== include words ====
    >more
    >some
    >words
    >==== customer name ====
    >akbar
    >amar
    >anthony
    >ram
    >sham
    >==== cities ====
    >delhi
    >mumbai
    >pune
    >=== prefix ===
    >a
    >an
    >the
    

    This approach works by appending an index number to every line and incrementing the index by one every time the line contains an '=', then sorting based on the index number first, then the actual word second, then removing the index and removing blank lines (which end up at the top of each 'section' after the sort).

    Edit

    I just saw @Bing Wang's comment - this is basically what he suggested you do