Search code examples
pythonparsingbert-language-model

How to parse or clean my corpus in Python


So I have this corpus with Dutch chat messages, but I want to remove the usernames within the < > brackets. I am not really familiar with parsing in python. Also, I'm not sure if parsing is the right way to remove the usernames. I am actually looking for advice. How do I remove the usernames in python.

This is what the .txt file looks like:

<Chickaaa> Heeerlijk zo'n kopje warme chocolademelk
<ilmas-nador> 3ndak  chi  khtk
<Chickaaa> met een sultana derbij
<bellamafia> hahah
<bellamafia> welkom terug chika
<Chickaaa> dankjee
<bellamafia> ga je nog naar school
<Chickaaa> jazeker
<bellamafia> ok
<Chickaaa> ben op stage nu
<Chickaaa> nog 7 uurtjes
<Chickaaa> pff
<bellamafia> wat doe je dan
<Chickaaa> management assistent
<bellamafia> ok
<Chickaaa> jij?

I need to put the sentences between a [CLS] and [SEP] if I want to tokenize them. The reason for this is to use the word embedding model BERT. I am reading the .txt as followed:

df = pd.read_fwf('moroccorp.txt')

After that I want to mark the sentences like this:

marked_text = "[CLS] " + df + " [SEP]"

and tokenize it in this way:

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

Solution

  • If your sample is representative, simply remove the <...> from each beginning of line.

    import re
    
    user = re.compile(r'^<[^<>]+>\s+')
    with open(filename) as corpus:
      text = [user.sub('', line) for line in corpus]
    

    If you want to do this in Pandas, it should not be hard to find a similar recipe for doing this transformation as part of your current code.

    Parsing generally refers to picking apart a structure of some sort (like dividing a sentence into subject, verb, and object), whereas this is a simple mechanical transformation.