I am preprocessing data for an NLP task and need to structure the data in the following way:
[tokenized_sentence] tab [tags_corresponding_to_tokens]
I have a text file with thousands of lines in this format, where the two lists are separated by a tab. Here is an example
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']
and the piece of code I used to get this is
with open('data.txt', 'w') as foo:
for i,j in zip(range(len(text)),range(len(tags))):
foo.write(str([item for item in text[i].split()]) + '\t' + str([tag for tag in tags[j]]) + '\n')
where text is a list containing sentences (i.e. each sentence is a string) and tags is a list of tags (i.e. the tags corresponding to each word/token in a sentence is a list).
I need to get the string elements in the lists to have double quotes instead of single quotes while maintaining this structure. The expected output should look like this
["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."] ["I-ORG", "O", "I-MISC", "O", "O", "O", "I-MISC", "O", "O"]
I've tried using json.dump()
and json.dumps()
from the json
module in Python but I didn't get the expected output as required. Instead, I get the two lists as strings. My best effort was to manually add the double quotes like this (for the tags)
for i in range(len(tags)):
for token in tags[i]:
tkn = "\"%s\"" %token
print(tkn)
which gives the output
"I-ORG"
"O"
"I-MISC"
"O"
"O"
"O"
"I-MISC"
"O"
"O"
"I-PER"
"I-PER"
.
.
.
however, this seems too inefficient. I have seen these related questions
but they didn't address this directly.
I'm using Python 3.8
I'm pretty sure there is no way to force python to write strings with double quotes; the default is single quotes. As @deadshot commented, you can either replace the '
with "
after you write the whole string to the file, or manually add the double quotes when you write each word. The answer of this post has many different ways to do it, the simplest being f'"{your_string_here}"'
. You would need to write each string separately though, as writing a list automatically adds '
around every item, and that would be very spaghetti.
Just do find and replace ' with "
after you write the string to the file.
You can even do it with python:
# after the string is written in 'data.txt'
with open('data.txt', "r") as f:
text = f.read()
text = text.replace("'", '"')
with open('data.txt', "w") as f:
text = f.write(text)
Do this instead of the above; this should fix most of the problems, as it searches for the string ', '
which, hopefully, only appears at the end of one string and the start of the next
with open('data.txt', "r") as f:
text = f.read()
# replace ' at the start of the list
text = text.replace("['", '["')
# replace ' at the end of the list
text = text.replace("']", '"]')
# replace ' at the item changes inside the list
text = text.replace("', '", '", "')
with open('data.txt', "w") as f:
text = f.write(text)
Running this solves the problem I described in the comment and returns the expected solution.
with open('data.txt', "r") as f:
text = f.read()
# replace ' at the start of the list
text = text.replace("['", '["')
# replace ' at the end of the list
text = text.replace("']", '"]')
# replace ' at the item changes inside the list
text = text.replace("', '", '", "')
text = text.replace("', ", '", ')
text = text.replace(", '", ', "')
with open('data.txt', "w") as f:
text = f.write(text)