Search code examples
pythonquotation-marks

Find and print text in quotation marks from a text file with python


I am a python beginner and want python to capture all text in quotation marks from a text file. I have tried the following:

filename = raw_input("Enter the full path of the file to be used: ")
input = open(filename, 'r')
import re
quotes = re.findall(ur'"[\^u201d]*["\u201d]', input)
print quotes

I get the error:

Traceback (most recent call last):
  File "/Users/nithin/Documents/Python/Capture Quotes", line 5, in <module>
    quotes = re.findall(ur'"[\^u201d]*["\u201d]', input)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

Can anyone help me out?


Solution

  • As Bakuriu has pointed out, you need to add .read() like so:

    quotes = re.findall(ur'[^\u201d]*[\u201d]', input.read())
    

    open() merely returns a file object, whereas f.read() will return a string. In addition, I'm guessing you are looking to get everything between two quotation marks instead of just zero or more occurences of [\^u201d] before a quotation mark. So I would try this:

    quotes = re.findall(ur'[\u201d][^\u201d]*[\u201d]', input.read(), re.U)
    

    The re.U accounts for unicode. Or (if you don't have two sets of right double quotation marks and don't need unicode):

    quotes = re.findall(r'"[^"]*"', input.read(), re.U)
    

    Finally, you may want to choose a different variable than input, since input is a keyword in python.

    Your result might look something like this:

    >>> input2 = """
    cfrhubecf "ehukl wehunkl echnk
    wehukb ewni; wejio;"
    "werulih"
    """
    >>> quotes = re.findall(r'"[^"]*"', input2, re.U)
    >>> print quotes
    ['"ehukl wehunkl echnk\nwehukb ewni; wejio;"', '"werulih"']