Search code examples
kotlinrecursion

How to Avoid Duplicates and Non-Existent Paths When Using Recursion?


I'm supposed to scrape data from an HTML string and use Jsoup to scrape it. However, I'm also using recursion to find any nested elements and found that it’s leading to duplication and results in paths that don’t exist.

HTML I'm scraping the data from:

 <body>
  <h1>Bookmarks Menu</h1>
  <dl>
   <p></p>
   <dt>
    <a href="link 1 In Bookmarks Menu" add_date="1726495666" last_modified="1726495904">In Bookmarks Menu</a>
   </dt>
   <dt>
    <a href="link 2 In Bookmarks Menu" add_date="1726495683" last_modified="1726495906">In Bookmarks Menu</a>
   </dt>
   <dt>
    <a href="link 3 In Bookmarks Menu" add_date="1726495691" last_modified="1726495908">In Bookmarks Menu</a>
   </dt>
   <dt>
    <h3 add_date="1724566401" last_modified="1726495673" personal_toolbar_folder="true">Bookmarks Toolbar</h3>
    <dl>
     <p></p>
     <dt>
      <a href="link 1 In Bookmarks Toolbar" add_date="1726495193" last_modified="1726495964" icon_uri="https://gitlab.com/favicon.ico">In Bookmarks Toolbar</a>
     </dt>
     <dt>
      <a href="link 2 In Bookmarks Toolbar" add_date="1726495297" last_modified="1726495966">In Bookmarks Toolbar</a>
     </dt>
     <dt>
      <a href="link 3 In Bookmarks Toolbar" add_date="1726495329" last_modified="1726495969">In Bookmarks Toolbar</a>
     </dt>
     <dt>
      <a href="link 4 In Bookmarks Toolbar" add_date="1726495346" last_modified="1726495971">In Bookmarks Toolbar</a>
     </dt>
     <dt>
      <h3 add_date="1726495366" last_modified="1726495656">Folder 1</h3>
      <dl>
       <p></p>
       <dt>
        <a href="link 1 In Folder 1" add_date="1726495484" last_modified="1726495941">In Folder 1</a>
       </dt>
       <dt>
        <a href="link 2 In Folder 1" add_date="1726495496" last_modified="1726495943">In Folder 1</a>
       </dt>
       <dt>
        <a href="link 3 In Folder 1" add_date="1726495506" last_modified="1726495945">In Folder 1</a>
       </dt>
       <dt>
        <a href="link 4 In Folder 1" add_date="1726495519" last_modified="1726495948">In Folder 1</a>
       </dt>
       <dt>
        <a href="link 5 In Folder 1" add_date="1726495528" last_modified="1726495951">In Folder 1</a>
       </dt>
       <dt>
        <h3 add_date="1726495578" last_modified="1726495656">Folder 2</h3>
        <dl>
         <p></p>
         <dt>
          <a href="link 1 In Folder 2" add_date="1726495615" last_modified="1726495919">In Folder 2</a>
         </dt>
         <dt>
          <a href="link 2 In Folder 2" add_date="1726495624" last_modified="1726495922">In Folder 2</a>
         </dt>
         <dt>
          <a href="link 3 In Folder 2" add_date="1726495633" last_modified="1726495923">In Folder 2</a>
         </dt>
         <dt>
          <a href="link 4 In Folder 2" add_date="1726495641" last_modified="1726495926">In Folder 2</a>
         </dt>
         <dt>
          <a href="link 5 In Folder 2" add_date="1726495649" last_modified="1726495927">In Folder 2</a>
         </dt>
         <dt>
          <a href="link 6 In Folder 2" add_date="1726495656" last_modified="1726495931">In Folder 2</a>
         </dt>
        </dl>
        <p></p>
       </dt>
      </dl>
      <p></p>
     </dt>
    </dl>
    <p></p>
   </dt>
   <dt>
    <h3 add_date="1724566401" last_modified="1726495825" unfiled_bookmarks_folder="true">Other Bookmarks</h3>
    <dl>
     <p></p>
     <dt>
      <a href="link 1 In Other Bookmarks" add_date="1726495812" last_modified="1726495874">In Other Bookmarks</a>
     </dt>
     <dt>
      <a href="link 2 In Other Bookmarks" add_date="1726495817" last_modified="1726495877">In Other Bookmarks</a>
     </dt>
     <dt>
      <a href="link 3 In Other Bookmarks" add_date="1726495821" last_modified="1726495880">In Other Bookmarks</a>
     </dt>
     <dt>
      <a href="link 4 In Other Bookmarks" add_date="1726495825" last_modified="1726495887">In Other Bookmarks</a>
     </dt>
    </dl>
    <p></p>
   </dt>
  </dl>
 </body>

Logs:

Link: In Bookmarks Menu, URL: link 1 In Bookmarks Menu
Link: In Bookmarks Menu, URL: link 2 In Bookmarks Menu
Link: In Bookmarks Menu, URL: link 3 In Bookmarks Menu
Folder: Bookmarks Toolbar
  Link: In Bookmarks Toolbar, URL: link 1 In Bookmarks Toolbar
  Link: In Bookmarks Toolbar, URL: link 2 In Bookmarks Toolbar
  Link: In Bookmarks Toolbar, URL: link 3 In Bookmarks Toolbar
  Link: In Bookmarks Toolbar, URL: link 4 In Bookmarks Toolbar
  Folder: Folder 1
    Link: In Folder 1, URL: link 1 In Folder 1
    Link: In Folder 1, URL: link 2 In Folder 1
    Link: In Folder 1, URL: link 3 In Folder 1
    Link: In Folder 1, URL: link 4 In Folder 1
    Link: In Folder 1, URL: link 5 In Folder 1
    Folder: Folder 2 // nested in folder 1
      Link: In Folder 2, URL: link 1 In Folder 2
      Link: In Folder 2, URL: link 2 In Folder 2
      Link: In Folder 2, URL: link 3 In Folder 2
      Link: In Folder 2, URL: link 4 In Folder 2
      Link: In Folder 2, URL: link 5 In Folder 2
      Link: In Folder 2, URL: link 6 In Folder 2
    Link: In Folder 2, URL: link 1 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 1 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 2 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 3 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 4 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 5 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 6 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 1, URL: link 1 In Folder 1 // This path doesn't exist in the original script
  Link: In Folder 1, URL: link 1 In Folder 1 // This path doesn't exist in the original script
  Link: In Folder 1, URL: link 2 In Folder 1 // This path doesn't exist in the original script
  Link: In Folder 1, URL: link 3 In Folder 1 // This path doesn't exist in the original script
  Link: In Folder 1, URL: link 4 In Folder 1 // This path doesn't exist in the original script
  Link: In Folder 1, URL: link 5 In Folder 1 // This path doesn't exist in the original script
  Folder: Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 1 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 2 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 3 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 4 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 5 In Folder 2 // This path doesn't exist in the original script
    Link: In Folder 2, URL: link 6 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 1 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 1 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 2 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 3 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 4 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 5 In Folder 2 // This path doesn't exist in the original script
  Link: In Folder 2, URL: link 6 In Folder 2 // This path doesn't exist in the original script
Link: In Bookmarks Toolbar, URL: link 1 In Bookmarks Toolbar // This path doesn't exist in the original script
Folder: Other Bookmarks
  Link: In Other Bookmarks, URL: link 1 In Other Bookmarks
  Link: In Other Bookmarks, URL: link 2 In Other Bookmarks
  Link: In Other Bookmarks, URL: link 3 In Other Bookmarks
  Link: In Other Bookmarks, URL: link 4 In Other Bookmarks
Link: In Other Bookmarks, URL: link 1 In Other Bookmarks // This path doesn't exist in the original script

The function looks like this (updated):

fun getFromHtml(folder: Elements, indentLevel: Int = 0) {
    folder.forEach { dtElement ->
        val h3Elements = dtElement.select("h3")
        h3Elements.forEach { h3Element ->
            if (h3Element != null) { // this doesn't make sense here as h3Element is not at all null when it reaches here
                logTheString("${" ".repeat(indentLevel * 2)}Folder: ${h3Element.text()}")
                val subFolder = dtElement.select("dl > dt")
                getFromHtml(subFolder, indentLevel + 1)
                return
            }
        }

        val aElement = dtElement.selectFirst("a")
        if (aElement != null) {
            logTheString(
                "${" ".repeat(indentLevel * 2)}Link: ${aElement.text()}, URL: ${
                    aElement.attr("href")
                }"
            )
        }
    }
}

fun main(){
    val document = Jsoup.parse(rawHtml)
    val bookmarkMenu = document.select("body > dl > dt")
    getFromHtml(bookmarkMenu)
}

i.e., I only want actual paths and their respective links; how I'm supposed to solve it, I can't get around my head with these topics, so any help is really appreciated. Thank you.


Solution

  • The problem comes from the fact that the nested <dl> is not a CHILD of the <h3>, but rather a sibling. When you see the <h3>, you call the function again and process the <dl> that follows, and when that returns, you go ahead and process the <dl> again. You need to do one or the other. If there is an <h3>, do a recursive call and exit. If not, then run the <a>. So, just insert a return after the recursive call:

            val h3Element = dtElement.selectFirst("h3")
    
            if (h3Element != null) {
                logTheString("${" ".repeat(indentLevel * 2)}Folder: ${h3Element.text()}")
                val subFolder = dtElement.select("dl > dt")
                getFromHtml(subFolder, indentLevel + 1)
                return;
            }
    
            val aElement = dtElement.selectFirst("a")
    

    Note that you don't need to read the "a" elements unless you know it's not an "h3".

    Followup.

    Since you said you were willing to look at other languages, here's a solution in Python using the BeautifulSoup HTML parser. Note that BeautifulSoup's parsing enforces the requirement that <dt> not be nested, so the <dt> that contains the <h3> contains ONLY the <h3>.

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(open('x.html').read(), 'lxml')
    
    def processDL(element, title, depth=''):
        # Look at all of the children of this <dl>.
        for d in element.children:
    
            # If it is a <dt>, then check to see whether it contains
            # a <a> or an <h3>.
            if d.name == 'dt':
                kids = [e for e in d.contents if e != '\n']
                first = kids[0]
                if first.name == 'h3':
                    title = next(first.children)
                    print(depth, "Folder: ",title)
                elif first.name == 'a':
                    print(depth, "Link:", title, "URL:", first['href'] )
    
            # If it is a <dl>, then process it recursively.
            elif d.name == 'dl':
                processDL(d, title, depth+'  ')
    
    processDL( soup.find('dl'), 'Bookmarks Menu' )
    

    Output:

     Link: Bookmarks Menu URL: link 1 In Bookmarks Menu
     Link: Bookmarks Menu URL: link 2 In Bookmarks Menu
     Link: Bookmarks Menu URL: link 3 In Bookmarks Menu
     Folder:  Bookmarks Toolbar
       Link: Bookmarks Toolbar URL: link 1 In Bookmarks Toolbar
       Link: Bookmarks Toolbar URL: link 2 In Bookmarks Toolbar
       Link: Bookmarks Toolbar URL: link 3 In Bookmarks Toolbar
       Link: Bookmarks Toolbar URL: link 4 In Bookmarks Toolbar
       Folder:  Folder 1
         Link: Folder 1 URL: link 1 In Folder 1
         Link: Folder 1 URL: link 2 In Folder 1
         Link: Folder 1 URL: link 3 In Folder 1
         Link: Folder 1 URL: link 4 In Folder 1
         Link: Folder 1 URL: link 5 In Folder 1
         Folder:  Folder 2
           Link: Folder 2 URL: link 1 In Folder 2
           Link: Folder 2 URL: link 2 In Folder 2
           Link: Folder 2 URL: link 3 In Folder 2
           Link: Folder 2 URL: link 4 In Folder 2
           Link: Folder 2 URL: link 5 In Folder 2
           Link: Folder 2 URL: link 6 In Folder 2
     Folder:  Other Bookmarks
       Link: Other Bookmarks URL: link 1 In Other Bookmarks
       Link: Other Bookmarks URL: link 2 In Other Bookmarks
       Link: Other Bookmarks URL: link 3 In Other Bookmarks
       Link: Other Bookmarks URL: link 4 In Other Bookmarks