Search code examples
autohotkeyscite

SciTe autohotkey to get active browser page's innertext


I have recently moved from excel VBA automation to try out the autohotkey automation based on http://the-automator.com/web-scraping-intro-with-autohotkey/ tutorial, but I can't seem to understand well the code, could someone please point me in the right direction?

I am trying to make my F1 key to scrape some data on the current active.

F1::

pwb := ComObjCreate("InternetExplorer.Application") ;create IE Object
pwb.visible:=true  ; Set the IE object to visible

pwb := WBGet()

;************Pointer to Open IE Window******************
WBGet(WinTitle="ahk_class IEFrame", Svr#=1) {               ;// based on ComObjQuery docs
   static msg := DllCall("RegisterWindowMessage", "str", "WM_HTML_GETOBJECT")
        , IID := "{0002DF05-0000-0000-C000-000000000046}"   ;// IID_IWebBrowserApp
;//     , IID := "{332C4427-26CB-11D0-B483-00C04FD90119}"   ;// IID_IHTMLWindow2
   SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, %WinTitle%

   if (ErrorLevel != "FAIL") {
      lResult:=ErrorLevel, VarSetCapacity(GUID,16,0)
      if DllCall("ole32\CLSIDFromString", "wstr","{332C4425-26CB-11D0-B483-00C04FD90119}", "ptr",&GUID) >= 0 {
         DllCall("oleacc\ObjectFromLresult", "ptr",lResult, "ptr",&GUID, "ptr",0, "ptr*",pdoc)
         return ComObj(9,ComObjQuery(pdoc,IID,IID),1), ObjRelease(pdoc)
      }
   }
}

I understand this code creates a new IE application, but what if I don't want to create one? Which is just to get the current active window? I saw a few codes that allow me to get the current active browser URL, but I can't seem to get the current active browser elements.

So far I have tried this. Can someone tell me how do I get it to point to the active page and get some of its data?

F1::

wb := WBGet()
if !instr(wb.LocationURL, "https://www.google.com/")
{
   wb := ""
   return
}
doc := wb.document
h2name    := rows[0].getElementsByTagName("h2")


FileAppend, %h2name%, Somefile.txt
Run Somefile.txt
return




WBGet(WinTitle="ahk_class IEFrame", Svr#=1) {               ;// based on ComObjQuery docs
   static msg := DllCall("RegisterWindowMessage", "str", "WM_HTML_GETOBJECT")
        , IID := "{0002DF05-0000-0000-C000-000000000046}"   ;// IID_IWebBrowserApp
;//     , IID := "{332C4427-26CB-11D0-B483-00C04FD90119}"   ;// IID_IHTMLWindow2
   SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, %WinTitle%
   if (ErrorLevel != "FAIL") {
      lResult:=ErrorLevel, VarSetCapacity(GUID,16,0)
      if DllCall("ole32\CLSIDFromString", "wstr","{332C4425-26CB-11D0-B483-00C04FD90119}", "ptr",&GUID) >= 0 {
         DllCall("oleacc\ObjectFromLresult", "ptr",lResult, "ptr",&GUID, "ptr",0, "ptr*",pdoc)
         return ComObj(9,ComObjQuery(pdoc,IID,IID),1), ObjRelease(pdoc)
      }
   }
}

Try to test if the variable would write onto the somefile.txt, not too sure how it should test with msgbox. It kept writing the whole script instead of showing the result.


Solution

  • To work on the active window's active tab (if it's an Internet Explorer window):

    q::
    WinGet, hWnd, ID, A
    WinGetClass, vWinClass, ahk_id %hWnd%
    if !(vWinClass = "IEFrame")
    Return
    wb := WBGet("ahk_id " hWnd)
    MsgBox % wb.document.activeElement.tagName "`r`n" wb.document.activeElement.innerText
    wb := ""
    Return
    

    To work on the first found Internet Explorer window's active tab:

    w::
    WinGet, hWnd, ID, ahk_class IEFrame
    wb := WBGet()
    ;wb := WBGet("ahk_class IEFrame") ;this line is equivalent to the one above
    MsgBox % wb.document.activeElement.tagName "`r`n" wb.document.activeElement.innerText
    wb := ""
    Return
    

    Regarding h2name, I don't believe that this will do anything, because 'rows' is not defined anywhere in the script.

    h2name    := rows[0].getElementsByTagName("h2")
    

    The following might work:

    h2name := ""
    try h2name := wb.document.getElementsByTagName("h2").item[0].name
    MsgBox % h2name
    
    MsgBox % wb.document.getElementsByTagName("h2").item[0].tagName
    MsgBox % wb.document.getElementsByTagName("h2").item[0].innerText
    

    In your link I think by 'name' they are referring to LocationName (the tab's title):

    MsgBox % wb.LocationName
    MsgBox % wb.document.title ;more reliable
    

    For the entire page's innerText:

    MsgBox % wb.document.documentElement.innerText
    

    HTH