Search code examples
pythonpreprocessorpython-3.8

How to convert string elements with single quotes to double quotes in a Python list


I am preprocessing data for an NLP task and need to structure the data in the following way:

[tokenized_sentence] tab [tags_corresponding_to_tokens]

I have a text file with thousands of lines in this format, where the two lists are separated by a tab. Here is an example

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']    ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']

and the piece of code I used to get this is

with open('data.txt', 'w') as foo:
    for i,j in zip(range(len(text)),range(len(tags))):
        foo.write(str([item for item in text[i].split()]) + '\t' + str([tag for tag in tags[j]]) + '\n')

where text is a list containing sentences (i.e. each sentence is a string) and tags is a list of tags (i.e. the tags corresponding to each word/token in a sentence is a list).

I need to get the string elements in the lists to have double quotes instead of single quotes while maintaining this structure. The expected output should look like this

["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."]    ["I-ORG",  "O", "I-MISC", "O", "O", "O", "I-MISC", "O", "O"]

I've tried using json.dump() and json.dumps() from the json module in Python but I didn't get the expected output as required. Instead, I get the two lists as strings. My best effort was to manually add the double quotes like this (for the tags)

for i in range(len(tags)):
    for token in tags[i]:
        tkn = "\"%s\"" %token
        print(tkn)

which gives the output

"I-ORG"
"O"
"I-MISC"
"O"
"O"
"O"
"I-MISC"
"O"
"O"
"I-PER"
"I-PER"
.
.
.

however, this seems too inefficient. I have seen these related questions

but they didn't address this directly.

I'm using Python 3.8


Solution

  • I'm pretty sure there is no way to force python to write strings with double quotes; the default is single quotes. As @deadshot commented, you can either replace the ' with " after you write the whole string to the file, or manually add the double quotes when you write each word. The answer of this post has many different ways to do it, the simplest being f'"{your_string_here}"'. You would need to write each string separately though, as writing a list automatically adds ' around every item, and that would be very spaghetti.

    Just do find and replace ' with " after you write the string to the file.

    You can even do it with python:

    # after the string is written in 'data.txt'
    with open('data.txt', "r") as f:
        text = f.read()
    
    text = text.replace("'", '"')
    
    with open('data.txt', "w") as f:
        text = f.write(text)
    

    Edit according to OP's comment below

    Do this instead of the above; this should fix most of the problems, as it searches for the string ', ' which, hopefully, only appears at the end of one string and the start of the next

    with open('data.txt', "r") as f:
        text = f.read()
    
    # replace ' at the start of the list
    text = text.replace("['", '["')
    
    # replace ' at the end of the list
    text = text.replace("']", '"]')
    
    # replace ' at the item changes inside the list
    text = text.replace("', '", '", "')
    
    with open('data.txt', "w") as f:
        text = f.write(text)
    

    (Edit by OP) New edit based on my latest comment

    Running this solves the problem I described in the comment and returns the expected solution.

    with open('data.txt', "r") as f:
        text = f.read()
    
    # replace ' at the start of the list
    text = text.replace("['", '["')
    
    # replace ' at the end of the list
    text = text.replace("']", '"]')
    
    # replace ' at the item changes inside the list
    text = text.replace("', '", '", "')
    
    text = text.replace("', ", '", ')
    
    text = text.replace(", '", ', "')
    
    with open('data.txt', "w") as f:
        text = f.write(text)