Search code examples
python-3.xweb-scrapingstring-parsing

How to remove javascript from string using python and then parse remaining string to table?


I have this string that i scraped from an unversity website. I want to parse it into a table where each row would consist of strings before and after a colon,":".

This is the string.

'課程中文名稱 Title of Course in Chinese:論文 課程英文名稱 Title of Course in English:Thesis (Projects) 應修系級 Major:法律學系博士班2 , 授課教師 Instructor:****** 選修類別 Required/Elective:必 全半學年 Whole or Half of the Academic Year:半學年 學 分 Credit(s):0 學分 時 數 Hour(s):0 小時 (function(window, $) { var sheetID = "1qkUIt6x8ry7F-etZJLMNKmEtDr0mwYdV3RNWw8fmOko", // 試算表代號 gid = "0", // 工作表代號 sql = "select%20B,%20C,%20D,%20E,%20F%20where%20G%20=%20'M6106'", // SQL 語法 callback = "callback"; // 回呼函數名稱 $.getScript("https://spreadsheets.google.com/tq?tqx=responseHandler:" + callback + "&tq=" + sql + "&key=" + sheetID + "&gid=" + gid); window[callback] = function(json) { var rowArray = json.table.rows, colArray = json.table.cols, rowLength = rowArray.length, colLength = colArray.length, html = "", i, j, dataGroup, dataLength, colName = new Array(); for (i = 0; i < colLength; i++) { colName[i] = colArray[i].label.replace(/彈性授課方式\W/g,''); } for (i = 0; i < rowLength; i++) { dataGroup = rowArray[i].c; dataLength = dataGroup.length; for (j = 0; j < dataLength; j++) { if (!dataGroup[j]) { continue; } if(dataGroup[j].v == "Y") html += colName[j] + ","; else if(j == (dataLength - 2) && dataGroup[j].v !== null) html += colName[j] + "-" + dataGroup[j].v + ","; } //if (dataGroup[dataLength - 2].v !== null) { //html += colName[dataLength - 2] + "-" + dataGroup[dataLength - 2].v + ","; //} html = html.substring(0,html.length - 1); html += "
"; } $("#test").html(html); if(html != "") $("#highlight").show(); }; })(window, jQuery); 「請遵守智慧財產權」及「不得非法複製及影印」。授課老師尚未建置課程大綱,若有需要請直接洽該任課教師!'

I tried to remove the javascript from this stack overflow page

An adhoc algorithm that i tried was just iteratively pairing the splitted string by every 2 element. This is the code.

spl = "the string"
spl = [spl[i:i + 2] for i in range(0, len(spl), 2)]

I do know that i can access alot of data if i execute the javascript from the browser doms. My question is how can i first parse out the javascript then parse the remaining string into a table?


Solution

  • Try:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query.queryGuide?g_serial=U1382&g_year=109&g_term=2&show_info=part"
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    for tr in soup.body.table.select("tr"):
        print(tr.get_text(strip=True))
        print("-" * 80)
    

    Prints:

    ...
    --------------------------------------------------------------------------------
    課程中文名稱 Title of Course in Chinese:大學英文1B課程英文名稱 Title of Course in English:College English應修系級 Major:語文通識1  ,中國文學系1  ,歷史學系1  ,休閒運動管理學系1  ,法律學系財經法組1  ,法律學系法學組1  ,法律學系司法組1  ,授課教師 Instructor:殷雅玲選修類別 Required/Elective:必向度類別 Classification:全半學年 Whole or Half of the Academic Year:全學年學  分 Credit(s):2學分時  數 Hour(s):2小時
    --------------------------------------------------------------------------------
    彈性授課方式:
    --------------------------------------------------------------------------------
    教師網址 Instructor's Website :
    --------------------------------------------------------------------------------
    教師專長 Instructor's Specialty :英語教學
    --------------------------------------------------------------------------------
    課綱附檔 Attachments :
    --------------------------------------------------------------------------------
    先修科目 Prerequisites:High school English
    --------------------------------------------------------------------------------
    
    ...and so on.