Search code examples
pythonparsingreadfile

f.readline versus f.read print output


I am new to Python (using Python 3.6). I have a read.txt file containing information about a firm. The file starts with different report characteristics

CONFORMED PERIOD REPORT:             20120928 #this is 1 line
DATE OF REPORT:                      20121128 #this is another line

and then starts all the text about the firm..... #lots of lines here

I am trying to extract both dates (['20120928','20121128']) as well as some strings that are in the text (i.e. if the string exists, then I want a '1'). Ultimately, I want a vector giving me both dates + the 1s and 0s of different strings, that is, something like: ['20120928','20121128','1','0']. My code is the following:

exemptions = [] #vector I want

with open('read.txt', 'r') as f:
    line2 = f.read()  # read the txt file
    for line in f:
        if "CONFORMED PERIOD REPORT" in line:
            exemptions.append(line.strip('\n').replace("CONFORMED PERIOD REPORT:\t", ""))  # add line without stating CONFORMED PERIOD REPORT, just with the date)
        elif "DATE OF REPORT" in line:
            exemptions.append(line.rstrip('\n').replace("DATE OF REPORT:\t", "")) # idem above

    var1 = re.findall("string1", line2, re.I)  # find string1 in line2, case-insensitive
    if len(var1) > 0:  # if the string appears, it will have length>0
        exemptions.append('1')
    else:
        exemptions.append('0')
    var2 = re.findall("string2", line2, re.I)
    if len(var2) > 0:
        exemptions.append('1')
    else:
        exemptions.append('0')

print(exemptions)

If I run this code, I obtain ['1','0'], omitting the dates and giving correct reads of the file, var1 exists (ok '1') and var2 does not (ok '0'). What I don't understand is why it doesn't report the dates. Importantly, when I change line2 to "line2=f.readline()", then I obtain ['20120928','20121128','0','0']. Ok with the dates now, but I know that var1 exists, it seems it doesn't read the rest of the file? If I omit "line2=f.read()", it spits out a vector of 0s for each line, except for my desired output. How can I omit these 0s?

My desired output would be: ['20120928','20121128','1','0']

Sorry for bothering. Thank you anyway!


Solution

  • The way I went through it was finally the following:

    exemptions = [] #vector I want
    
    with open('read.txt', 'r') as f:
        line2 = "" # create an empty string variable out of the "for line" loop
        for line in f:
            line2 = line2 + line #append each line to the above created empty string
            if "CONFORMED PERIOD REPORT" in line:
                exemptions.append(line.strip('\n').replace("CONFORMED PERIOD REPORT:\t", ""))  # add line without stating CONFORMED PERIOD REPORT, just with the date)
            elif "DATE OF REPORT" in line:
                exemptions.append(line.rstrip('\n').replace("DATE OF REPORT:\t", "")) # idem above
    
        var1 = re.findall("string1", line2, re.I)  # find string1 in line2, case-insensitive
        if len(var1) > 0:  # if the string appears, it will have length>0
            exemptions.append('1')
        else:
            exemptions.append('0')
        var2 = re.findall("string2", line2, re.I)
        if len(var2) > 0:
            exemptions.append('1')
        else:
            exemptions.append('0')
    
    print(exemptions)
    

    So far this is what I got. It worked for me, although I guess working with beautifulsoup would increase the efficiency of the code. Next step :)