Search code examples
pythonweb-scrapingbeautifulsoupscreen-scraping

How can I scrape when server returns like this data?


I was trying to scrape a site using Beautifulsoup and python requests and the server returns response with content type text/javascript with the response body containing this data:

Element.update("students", "<link href=\"https://somelinks.links\" media=\"screen\" rel=\"stylesheet\" type=\"text/css\" />\n\n<table class=\"gray_table_list\" align=\"center\" width=\"100%\" cellpadding=\"0\" cellspacing=\"0\">\n    \n    <tr class=\"main_head back_ground_color\">\n    <td class=\"sl-col\">Sl No.</td>\n    <td class=\"set_border_right\"> Name</td>\n    <td class=\"set_border_right\">IDNo.</td>\n        \n        <td class=\"set_border_right\"></td>\n      </tr>\n      <tr class=\"tr-blank\">\n\n      \n      </tr>\n      \n        <tr class=\"row-bodd\">\n        <td class=\"set_border_right col-1\">\n            1\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\">JOHN DOE  </a>\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            ID12345\n          </td>\n\n          \n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">View profile</a>\n          </td>\n        </tr>\n      \n        <tr class=\"row-beven\">\n        <td class=\"set_border_right col-1\">\n            2\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">Somename here  </a>\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            ID45555\n          </td>\n\n          \n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n          </td>\n        </tr>\n      \n        <tr class=\"row-bodd\">\n        <td class=\"set_border_right col-1\">\n            3\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">name here  </a>\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            ID7878\n          </td>\n\n          \n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">View profile</a>\n          </td>\n        </tr>\n\n </tr>\n      \n    \n    </table>\n  \n");

I have redacted the server response for self-containing the question. But how can I scrape that table in Element.update and I also wanted to scrape a tag in the table I mean to extract the data link/linkprofile/linkID1458556 from <a href=\"/link/linkprofile/linkID1458556\">

Thanks


Solution

  • You can use re module to extract the HTML portion from this Javascript function and then parse it with BeautifulSoup normally. For example:

    import re
    from bs4 import BeautifulSoup
    
    s = """
    Element.update("students", "<link href=\"https://somelinks.links\" media=\"screen\" rel=\"stylesheet\" type=\"text/css\" />\n\n<table class=\"gray_table_list\" align=\"center\" width=\"100%\" cellpadding=\"0\" cellspacing=\"0\">\n    \n    <tr class=\"main_head back_ground_color\">\n    <td class=\"sl-col\">Sl No.</td>\n    <td class=\"set_border_right\"> Name</td>\n    <td class=\"set_border_right\">IDNo.</td>\n        \n        <td class=\"set_border_right\"></td>\n      </tr>\n      <tr class=\"tr-blank\">\n\n      \n      </tr>\n      \n        <tr class=\"row-bodd\">\n        <td class=\"set_border_right col-1\">\n            1\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\">JOHN DOE  </a>\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            ID12345\n          </td>\n\n          \n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">View profile</a>\n          </td>\n        </tr>\n      \n        <tr class=\"row-beven\">\n        <td class=\"set_border_right col-1\">\n            2\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">Somename here  </a>\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            ID45555\n          </td>\n\n          \n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n          </td>\n        </tr>\n      \n        <tr class=\"row-bodd\">\n        <td class=\"set_border_right col-1\">\n            3\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">name here  </a>\n          </td>\n\n          <td class=\"set_border_right col-1\">\n            ID7878\n          </td>\n\n          \n\n          <td class=\"set_border_right col-1\">\n            <a href=\"/link/linkprofile/linkID1458556\\">View profile</a>\n          </td>\n        </tr>\n\n </tr>\n      \n    \n    </table>\n  \n");
    """
    
    html_doc = re.search(r'"students", "(.*?)"\);', s, flags=re.S).group(1)
    soup = BeautifulSoup(html_doc, "html.parser")
    
    for tr in soup.select("tr"):
        tds = [td.get_text(strip=True) for td in tr.select("td")]
        print(*tds, sep="\t")
    

    Prints:

    Sl No.  Name    IDNo.
    
    1       JOHN DOE        ID12345 View profile
    2       Somename here   ID45555 View profile
    3       name here       ID7878  View profile
    

    EDIT: To get <a> links:

    for tr in soup.select("tr:has(a)"):
        print(tr.a["href"])
    

    Prints:

    /link/linkprofile/linkID1458556
    /link/linkprofile/linkID1458556\
    /link/linkprofile/linkID1458556\