I'm trying to write a program to parse a series of HTML files and store the resulting data in a .csv spreadsheet, which is incredibly reliant on newlines being in exactly the right place. I've tried every method I can find to strip the linebreaks away from certain pieces of text, to no avail. The relevant code looks like this:
soup = BeautifulSoup(f)
ID = soup.td.get_text()
ID.strip()
ID.rstrip()
ID.replace("\t", "").replace("\r", "").replace("\n", "")
dateCreated = soup.td.find_next("td").get_text()
dateCreated.replace("\t", "").replace("\r", "").replace("\n", "")
dateCreated.strip()
dateCreated.rstrip()
# debug
print('ID:' + ID + 'Date Created:' + dateCreated)
And the resulting code looks like this:
ID:
FOO
Date Created:
BAR
This and another problem with the same program have been driving me up the wall. Help would be fantastic. Thanks.
EDIT: Figured it out, and it was a pretty stupid mistake. Instead of just doing
ID.replace("\t", "").replace("\r", "").replace("\n", "")
I should have done
ID = ID.replace("\t", "").replace("\r", "").replace("\n", "")
Your issue at hand is that you're expecting in-place operations from what are actually operations that return new values.
ID.strip() # returns the rstripped value, doesn't change ID.
ID = ID.strip() # Would be more appropriate.
You could use regex, though regex is overkill for this process. Realistically, especially if it's beginning and ending characters, just pass them to strip:
ID = ID.strip('\t\r\n')