So I have this corpus with Dutch chat messages, but I want to remove the usernames within the < > brackets. I am not really familiar with parsing in python. Also, I'm not sure if parsing is the right way to remove the usernames. I am actually looking for advice. How do I remove the usernames in python.
This is what the .txt file looks like:
<Chickaaa> Heeerlijk zo'n kopje warme chocolademelk
<ilmas-nador> 3ndak chi khtk
<Chickaaa> met een sultana derbij
<bellamafia> hahah
<bellamafia> welkom terug chika
<Chickaaa> dankjee
<bellamafia> ga je nog naar school
<Chickaaa> jazeker
<bellamafia> ok
<Chickaaa> ben op stage nu
<Chickaaa> nog 7 uurtjes
<Chickaaa> pff
<bellamafia> wat doe je dan
<Chickaaa> management assistent
<bellamafia> ok
<Chickaaa> jij?
I need to put the sentences between a [CLS] and [SEP] if I want to tokenize them. The reason for this is to use the word embedding model BERT. I am reading the .txt as followed:
df = pd.read_fwf('moroccorp.txt')
After that I want to mark the sentences like this:
marked_text = "[CLS] " + df + " [SEP]"
and tokenize it in this way:
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
If your sample is representative, simply remove the <...>
from each beginning of line.
import re
user = re.compile(r'^<[^<>]+>\s+')
with open(filename) as corpus:
text = [user.sub('', line) for line in corpus]
If you want to do this in Pandas, it should not be hard to find a similar recipe for doing this transformation as part of your current code.
Parsing generally refers to picking apart a structure of some sort (like dividing a sentence into subject, verb, and object), whereas this is a simple mechanical transformation.