Search code examples
htmlpython-3.xparsingbeautifulsouphtml-parsing

Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags


So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:

<div class ="Content">
  <Strong>Title:</strong>
  description
</div>

As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together. What my script does kinda looks like:

article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
  x=0
  for(i in result.find("strong")):
    if(x==0):
      temp1 = "<strong>" + i.text + "</strong>"
      article += temp1
      x=1
    else:
      temp2 = i.nextSibling #I know this is wrong
      article += temp2
      x = 0
print(article) 

It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".

I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...

what I need to get is: "Title: description"

Thanks in advance for any response.


I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:

http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>

So that the result is:

  • Title1: description

So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce


Solution

  • EDIT

    To select the <strong> use:

    soup.select_one('div.Content strong')
    

    and then to select its nextSibling:

    strong.nextSibling
    

    you my need to strip it to get rid of whitespaces, ....:

    strong.nextSibling.strip()
    

    Just in case

    You can use ANSI escape sequences to print something bold, ... but I am not sure, why you would do that. That is something should be improved in your question.

    Example

    from bs4 import BeautifulSoup
    
    html='''
    <div class ="Content">
      <Strong>Title:</strong>
      description
    </div>
    '''
    soup = BeautifulSoup(html,'html.parser')
    text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')
    
    print('\033[1m'+text[0]+': \033[0m'+ text[1])
    

    Output

    Title: description