Search code examples
regexgrepregex-groupcyrillic

Search for three words written in different "shapes" using grep


I have a text file with the following contents:

**gvožđa gvozda gvozdja
гвожђа

It’s four words, but each means one thing: iron.

The "d", "dj", "đ", "ђ" are four letters indicating a one "phone".

I am using the following grep formula to search for these three words:

grep '\s*[gг][vв]o[žжz](dj|[dđђ])a\s*' filename

This grep command gives no output at all. Why? It should gives all these words in the file:

gvožđa
gvozda
gvozdja
гвожђа

Solution

  • The problem occurs due to the fact that your pattern does not match Cyrillic о and а, and because you use a POSIX ERE pattern without the -E option.

    You can use

    grep -Eo '[gг][vв][oо][žжz](dj|[dđђ])[aа]' filename
    

    Using \s* does not actually make sense as it only matches zero or more whitespace chars (only in GNU grep).

    I added -o option here to output all matches, not just matched lines.

    See the online grep demo.