Search code examples
regexstringreplaceanki

Advanced text replacement (cloze deletion)


Well, I'd like to replace specific texts based on text, yeah sounds funny, so here it is.

The problem is how to replace the tab-separated values. Essentially, what I'd like to do is replace the matching vocabulary string found on the sentence with {...}.

The value before the tab \t is the vocab, the value after the tab is the sentence. The value on the left of the \t is the first column, to its right is the second column


TL;DR Version (English Version)
Essentially, I want to replace the text on the second column based on the first Column.

Examples:
ABCD \t 19475ABCD_97jdhgbl
would turn into
ABCD \t 19475{...}_97jdhgbl

ABCD is the first column here and 19475ABCD_97jdhgbl is the second one.

If you don't get the context of the Long Version below, solving this ABCD problem would be fine by me. I think it's quite a simple code but given that it's been about 4 years since I last coded in C and I've only recently started learning python, I can't do it.


Long Version: (Japanese-specific text)
1. Case 1: (For pure Kanji)
全部 \t それ、全部ください。
would become
全部 \t それ、{...}ください。

2. Case 2: (For pure Kana)**
ああ \t ああうるさい人は苦手です。
would become
ああ \t {...}うるさい人は苦手です。

あいづち \t 彼の話に私はあいづちを打ったの。
would become
あいづち \t 彼の話に私は{...}を打ったの。

For Case 1 and Case 2 it has to be exact matches, especially for kana because otherwise it might replace other kana in the sentence. The coding for Case 3 has to be different (see next).

3. Case 3: (for mixed Kana and Kanji)
This is the most complex one. For this one, I'd like the script/solution to change only the matching strings, i.e., it will ignore what doesn't match and only replace those with found matches. What it does is it takes the longest possible match and replace accordingly.
上げる \t 彼は荷物をあみだなに上げた。
would become
上げる \t 彼は荷物をあみだなに{...}た。

Note here that the first column has 上げる but the second column has 上げた because it has changed in tense (First column has る while the second one has た).

So, Ideally the solution should take the longest string found in both columns, in this case it is 上げ, so this is the only string replaced with {...}, while it leaves .

Another example
が増える \t 値段がが増える
would become
が増える \t 値段が{...}


More TL;DR

I'm actually using this for Anki.

I could use excel or notepad++ but I don't think they could replace text based on placeholders.

My goal here is to create pseudo-cloze sentences that I can use as hints hidden in a hint field only to be used for ridiculously hard synonyms or homonyms (I have an Auditory card).

I know I'm missing a fourth case, i.e., pure kana with the possibility of a sentence having changed its tense, hence its spelling. Well, that'd be really hard to code so I'd rather do it manually so as not to mess up the other kana in the sentence.


Update
I forgot to say that the text is contained in a .txt file in this format:

全部 \t それ、全部ください。
ああ \t ああうるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
上げる \t 彼は荷物をあみだなに上げた。

There are about 7000 lines of those things so it has to check the replacements for every line.


Code works, thanks, just a minor bug with sentences including non-full replacements, it creates broken characters.

上げたxxxx 彼は荷物をあみだなに上げあ。
ABCD    ABCD123
86876   xx86876h897
全部  それ、全部ください
ああ  ああうるさい人は苦手です。
上げたxxxx 彼は荷物をあみだなに上げあ。
務める ああうるさい人は苦手で務めす。
務める ああうるさい務めす人は苦手で。

turns into:

enter image description here


Just edited James' code a bit for testing purposes (I'm using this edited version to check what kind of strings would throw off the code. So far I've discovered that spaces in the vocabulary could cause some trouble.

This code prints the original line below the parsed line.
Just change this line:
fout.write(output)
to this
fout.write(output+str(line)+'\n')


Solution

  • This regex should deal with the cases you are looking for (including matching the longest possible pattern in the first column):

    ^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$

    Regex demo here.

    You can then go on to use the match groups to make the specific replacement you are looking for. Here is an example solution in python:

    import re
    
    regex = re.compile(r'^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$')
    
    with open('output.txt', 'w', encoding='utf-8') as fout:
        with open('file.txt', 'r', encoding='utf-8') as fin:
            for line in fin:
                match = regex.match(line)
                if match:
                    hint = match.group(3).replace(match.group(1), '{...}')
                    output = '{0}\t{1}\n'.format(match.group(1) + match.group(2), hint)
                    fout.write(output)
    

    Python demo here.