Search code examples
pythonxpathbeautifulsoupixmldomelement

How to get text which has no HTML tag | Add multiple delimiters in split


Following XPath select div element with class ajaxcourseindentfix and split it from Prerequisite and gives me all the content after prerequisite.

div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]

My div can have not only prerequisite but also the following splitting points:

Prerequisites
Corerequisite
Corerequisites

Now, whenever I have Prerequisite, above XPath works fine but whenever anything from above three comes, the XPath fails and gives me the whole text.

Is there a way to put multiple delimiters in XPath? Or how do I solve it?

Sample pages:

Corequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96106&show

Prerequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show

Both: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=98590&show

[Old Thread] - How to get text which has no HTML tag


Solution

  • This code is the solution to your problem unless you need XPath specifically, I would also suggest that you review BeautifulSoup documentation on the methods I've used, you can find that HERE

    .next_element and .next_sibling can be very useful in these cases. or .next_elements we'll get a generator that we'll have either to convert or use it in a manner that we can manipulate a generator.

    from bs4 import BeautifulSoup
    import requests
    
    
    url = 'http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show'
    makereq = requests.get(url).text
    
    soup = BeautifulSoup(makereq, 'lxml')
    
    whole = soup.find('td', {'class': 'custompad_10'})
    # we select the whole table (td), not needed in this case
    thedivs = whole.find_all('div')
    # list of all divs and elements within them
    
    title_h3 = thedivs[2]
    # we select only yhe second one (list) and save it in a var
    
    mytitle = title_h3.h3
    # using .h3 we can traverse (go to the child <h3> element)
    
    mylist = list(mytitle.next_elements)
    # title_h3.h3 is still part of a three and we save all the neighbor elements 
    
    the_text = mylist[3]
    # we can then select specific elements 
    # from a generator that we've converted into a list (i.e. list(...))
    
    prequisite = mylist[6]
    
    which_cpsc = mylist[8]
    
    other_text = mylist[11]
    
    print(the_text, ' is the text')
    print(which_cpsc, other_text, ' is the cpsc and othertext ')
    # this is for testing purposes
    

    Solves both issues, we don't have to use CSS selectors and those weird list manipulations. Everything is organic and works well.