Search code examples
pythonsplitstrip

Python - Extract words from quotes from a paragraph without regex


I have the following paragraph as Input from a .txt file:

... Lorem "ipsum dolor sit amet, consectetur adipiscing elit.". Praesent non sem urna. Pellentesque elementum "turpi'" est, "in fermentum diam auctor aliquam!". Morbi rhoncus erat ipsum, eu "tristique" ...

Here it is as a Python string:

'Lorem "ipsum dolor sit amet, consectetur adipiscing elit.". Praesent non sem urna. Pellentesque elementum "turpi" est, "in fermentum diam auctor aliquam!". Morbi rhoncus erat ipsum, eu "tristique"'

I want to create a list of only the quoted phrases and isolate the words within the quotes as a list (delimited by white spaces).

Output:

['ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit.', 'turpi'', 'in', 'fermentum', 'diam', 'auctor', 'aliquam!', 'tristique']

My thought process has been to read in the file and then somehow split the paragraph by quotes but I cannot seem to find a way to get 'split()' working how I want. I have a feeling this can be done with minimal looping and using split() as a means to organize the data WITHOUT the use of re, shlex, csv or other imported modules.

I even thought about adding the delimiter back into the list and then 'cleaning' the list. But even this feels a bit complicated than it should be.

The code below adds double quotes to every item in the array, which is not what I want. Just a way I felt I could keep track of the quote after using split().

with open(input_file, "r") as read_file:
     for line in read_file:
          quotes = ['"' + i + '"' for i in line.split('"') if i]

Solution

  • copied from my comment:

    once you split using " as a delimiter, you can simply extract all the odd-indexed elements of the list. Then, split those normally (with whitespace delimiter) and concatenate the lists together.

    Example:

    text = """Lorem "ipsum dolor sit amet, consectetur adipiscing elit.". Praesent non sem urna. Pellentesque elementum "turpi'" est, "in fermentum diam auctor aliquam!". Morbi rhoncus erat ipsum, eu "tristique" """
    
    text_split_by_quotes = text.split('"')
    # get the odd-indexed elements (here's one way to do it):
    text_in_quotes = text_split_by_quotes[1::2]
    # split each normally (by whitespace) and flatten the list (here's one way to do it):
    ans = []
    for text in text_in_quotes:
        ans.extend(text.split())
    # print answer
    print(ans)
    
    >>> ['ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit.', "turpi'", 'in', 'fermentum', 'diam', 'auctor', 'aliquam!', 'tristique']