Search code examples
pythonbeautifulsouppycharmindentationpasting

Cannot Paste HTML into String in Python


I am trying to parse some HTML by passing in the HTML into a single string object. However, when I paste in the HTML, I get a ton of underlines in pyCharm, which I suspect is because of the formatting (See screenshot). This breaks my program, because I am splitting on \n\n, which should represent a blank line.

This is what I get when I paste in the code:

badPyCharm

However, this is what I want, which has no problems when I split the string with \n\n:

goodPyCharm

I have tried pasting the html that I want to use as a string into notePad and converting to plainText, but to no avail. I have also turned off any 'auto indent' features in PyCharm. Can anyone tell me how to fix this, so I can paste in longer chunks of HTML (of the same structure, separated by blank lines) and still have my code work? Or is there some way to now what to split the string by when I paste in long chunks of HTML (my intuition is that some tabs get added, but I can't figure it out)?!


Solution

  • I would say as a way of help without having access to the real HTML/XML text (not as an image) and seeing that both sample texts look different if compared to each other

    1. Your code shouldn't break because of something wrong inside your text variable when you use triple single/double quotes --off-topic comment from PEP-0257 is that you use triple double quotes for docstring instead of multi-line text (on which you use triple single quote)
    2. You can always try any HTML/XML formatter online and paste your text in there before adding it in your IDE script. Alike you do with JSON-formatted content to check the validity. Those formatters help detect what is wrong in your text according to the parsing criterion
    3. Another option, since you're using BeautifulSoup, pass your "fullHtmlString" variable as parameter with the "lxml" parser (you have to install it at OS level [libxml2 and libxslt] and via pip [pip3.6 install lxml as an example] before) and let BeautifulSoup help you see what is visibly wrong in your HTML/XML text while printing it

      soup = BeautifulSoup(fullHtmlString, 'lxml')
      print(soup.prettify())
      
    4. You can use "reformat code" and "fill paragraph" options together in PyCharm to format your entire code, especially when is outside margins according to PEP-0008 that, when combined, you usually see by yourself whatever errors you have in a syntax sense

    Hope it helps (: