I am trying to parse some HTML by passing in the HTML into a single string object. However, when I paste in the HTML, I get a ton of underlines in pyCharm, which I suspect is because of the formatting (See screenshot). This breaks my program, because I am splitting on \n\n, which should represent a blank line.
This is what I get when I paste in the code:
However, this is what I want, which has no problems when I split the string with \n\n:
I have tried pasting the html that I want to use as a string into notePad and converting to plainText, but to no avail. I have also turned off any 'auto indent' features in PyCharm. Can anyone tell me how to fix this, so I can paste in longer chunks of HTML (of the same structure, separated by blank lines) and still have my code work? Or is there some way to now what to split the string by when I paste in long chunks of HTML (my intuition is that some tabs get added, but I can't figure it out)?!
I would say as a way of help without having access to the real HTML/XML text (not as an image) and seeing that both sample texts look different if compared to each other
Another option, since you're using BeautifulSoup, pass your "fullHtmlString" variable as parameter with the "lxml" parser (you have to install it at OS level [libxml2
and libxslt
] and via pip [pip3.6 install lxml
as an example] before) and let BeautifulSoup help you see what is visibly wrong in your HTML/XML text while printing it
soup = BeautifulSoup(fullHtmlString, 'lxml')
print(soup.prettify())
You can use "reformat code" and "fill paragraph" options together in PyCharm to format your entire code, especially when is outside margins according to PEP-0008 that, when combined, you usually see by yourself whatever errors you have in a syntax sense
Hope it helps (: