Search code examples
bashawksed

How to query text file version of LibreOffice Thesaurus in bash (joining lines)


I am trying to write a simple script in bash to query the LibreOffice thesaurus extension as a text file. For each input query string, I want the output to be all the related strings. And I want to do this in bash.

To download and extract the thesaurus, I do

wget "https://extensions.libreoffice.org/assets/downloads/41/1653961771/dict-en-20220601_lo.oxt" # download LO dictionary & thesaurus

unzip -p dict-en-20220601_lo.oxt th_en_US_v2.dat > lo # extract contents of thesaurus to text file

Taking a look at part of the text file:

nine|3
(adj)|9|ix|cardinal (similar term)
(noun)|9|IX|niner|Nina from Carolina|ennead|digit (generic term)|figure (generic term)
(noun)|baseball club|ball club|club|baseball team (generic term)
nine-banded armadillo|1
(noun)|peba|Texas armadillo|Dasypus novemcinctus|armadillo (generic term)
nine-fold|1
(adj)|nonuple|ninefold|multiple (similar term)
nine-membered|1
(adj)|9-membered|membered (similar term)
nine-sided|1
(adj)|multilateral (similar term)|many-sided (similar term)
nine-spot|1
(noun)|spot (generic term)

So for example, I want to be able input "nine" as a query and have bash return something like

9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team

I think this should be fairly easy to do using the right syntax with awk or sed, especially since all of the lines containing query terms do NOT begin with "(" and all of the line containing related terms DO begin with "(".

But I'm still somewhat of a newbie, and haven't been able to figure it out yet. The crux of the matter for me seems to be getting the query term and all related terms onto a single line. From there, I know how to sed my way to victory. But getting to that point has proven challenging for me.

TIA for your help!

p.s. I'm trying to do something similar to this, but my situation is a little different, and I don't understand the syntax well enough to modify it for my needs: https://www.unix.com/unix-for-dummies-questions-and-answers/184649-sed-join-lines-do-not-match-pattern.html


Solution

  • This might work for you (GNU sed):

    v=nine
    sed -n ':a;/^'"${v}"'|/{:b;n;/^[^(]/ba;s/^[^|]*|\| ([^)]*)//g;y/|/\n/;p;bb}' file
    

    Focus on any lines following a match on the input variable.

    Fetch the following line and if it does not begin with (, then repeat above.

    Otherwise, remove the first field and any values between parens, replace the field separators | by newlines, print the result and repeat.


    v=nine # set variable v to `nine`
    sed -n ':a # turn off implicit printing and set goto label a
            /^'"${v}"'|/{ # match a line beginning with variable v
              :b # set goto label b
              n # fetch next line (do not print see option -n)
              /^[^(]/ba # goto label a if line does not begin (
              s/^[^|]*|\| ([^)]*)//g # remove first field and parens
              y/|/\n/ # translate | to newline for entire line
              p # print the result
              bb # goto label b
            }' file
    

    To see the sed script in action invoke the --debug option.