Search code examples
regexregex-lookaroundsregex-group

Regex: Replace double double quotes (solved), but only in lines that contain a special string (subcondition unsolved)


1. Summary of the problem

I have a csv file where I want to replace normal quotes in text with typographic ones.

It was hard (because HTML is also included), but I have meanwhile created a good regex expression that does just the right thing: in three "capturing groups" I find the left and right quotation marks and the text inside. Replacing then is a piece of cake.

2. Regex engine

I can use the regex engine of Notepad++ (boost) or PCRE2 comaptible, for developping and testing purposes I have used https://regex101.com.

3. What I'm having a hard time with and just can't get right, where I need your help is here:

I want to add a sub condition, in order to find the text in quotes only in certain lines, want to identify these lines by the language, e.g. ENGLISH or FRENCH (see also example in the screenshot).

Screenshot of a sample

The string indicating the language is always in the same line before the text to be found, BUT only the text in quotes (main condition) should be marked after matching the sub condition, so that I will be able to replace them.

It is about a few thousand records in the csv file, in the worst case I could also replace it manually. But I'm pretty sure that this should also work via regex.

4. What I have tried

Different approaches with look arounds and non-capturing groups didn't lead me to the desired result - possibly because I didn't really understand how they work.

An example can be found here: https://regex101.com/r/ketwwm/1

The example can be found here, it only contains the regex expression to match and mark the (three) groups WITHOUT the searched subcondition:

("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))

Hopefully anyone in the community could help? (Hopefully I have not missed anything, it's my first post here )

5. Update 03/18/2022: Almost resolved with two slightly different approaches (thank you all!) What is still unsolved ..

  1. Solution of @Thefourthbird (see answer 1) ^(?!.?"ENGLISH")[^"]".*(SKIP)(F)|("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))

Nearly perfect, just missing matches in an HTML section. HTML sections in the csv file are always enclosed by double quotes and may have line feeds (LF). https://regex101.com/r/x5shnx/1

  1. Solution of @Wiktor Stribiżew (see in comments below) ^.?"ENGLISH".?\K("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))

The same with matches in HTML sections, see above. Plus: Doesn't match text in double double quotes if more than one such entry occurs within a text. https://regex101.com/r/I4NTdb/1

Screenshot (only to illustrate)


Solution

  • If you want to match multiple occasions, you can use SKIP matching all lines that do not start with FRENCH:

    ^"(?!FRENCH")[^"]*".*(*SKIP)(*F)|("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
    

    The pattern matches:

    • ^ Start of string
    • " Match literally
    • (?!FRENCH") Negative lookhead, assert not FRENCH" directly to the right
    • [^"]*" Match any char except " and match "
    • .*(*SKIP)(*F) Match the rest of the line and skip it
    • | Or
    • ("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$))) Your current pattern

    Regex demo

    enter image description here