Search code examples
phpregexlibreofficelibreoffice-calc

How to add whitespace & punctuation marks to capture first group with regex? How to stop certain tabs dividing into two columns within LibreOffice?


Anyone help me out. Been trying to get this regex working, and it’s nearly there. They all seem to be correct, but the first one should be:

word: el, la
gender: art
word_en: the (+m, f)

The first test string is:

1

el, la art the (+m, f)
• el diccionario tenía también frases útiles – the dictionary also had
useful phrases
2055835 | 201481381

The other issue is that I’ve been trying to simply copy info. from the ‘Substitution’ section into LibreOffice. All I want to do is create 6 columns for the data. The Problem is that the 6th column (sent_en) can sometimes divide between columns ‘G’ and ‘A’, instead of all the data for sent_en being in column ‘G’. If you copy the data below ‘Substitution’ into LibreOffice Calc, you’ll get a better idea of what I mean. I just can’t figure this out, and if someone can help me out I’d really appreciate it. Thanks.

Here’s the link https://regex101.com/r/m3yySN/2/

^

(?<frequency>[0-9]+) \W+
(?<word>\pL+\W?) \h+
(?<gender> [\pL()]+ (?:, \h* [\pL()]+)* ) \h+
(?<word_en> [^•]*[^•\s]) \h* \R

• \h*
(?<sent_esp> [^–]*[^\s–] ) \s*–\s*
(?<sent_en> .* (?:\R .*)*? ) \h* \R

(?<num1> [0-9]+) \h* \| \h*
(?<num2> .*\S)

\1\t\2\t\3\t\4\t\5\t\6\t

Solution

  • This one was a bit hairy, but after all, just a small adjustment was needed:

    ^
    (?<frequency>[0-9]+) \W+
    (?<word>\pL+(?:,\h\pL+|\W)*) \h+
    (?<gender> [\pL()]+ (?:, \h* [\pL()]+)* ) \h+
    (?<word_en> [^•]*[^•\s]) \h* \R
    • \h*
    (?<sent_esp> [^–]*[^\s–] ) \s*–\s*
    (?<sent_en> .* (?:\R .*)*? ) \h* \R
    (?<num1> [0-9]+) \h* \| \h*
    (?<num2> .*\S)
    

    Results look good to me now.