Search code examples
excelredditvba

Excel Macro to draw thread comments from website into cells


I am trying to store Reddit thread comments in an excel spreadsheet, however I have had trouble trying to figure out how to do this. I do not have much experience with using macros to get data from webpages, so I have been finding it hard to figure out how exactly to draw out each comment from a specified Reddit thread and place it in a cell, and whether or not it is possible to do.

This is what I have so far:

Sub getRedditData()

Dim x As Long, y As Long
Dim htm As Object

Set htm = CreateObject("htmlFile")

With CreateObject("msxml2.xmlhttp")
    .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
    .send
    htm.body.innerhtml = .responsetext
End With

With htm.getelementbyid("comments")
    Set cellrangex = .Rows(x).Cells.Length - 1
    Set cellrangey = .Rows(x).Cells.Length - 1
    Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
    Set cellrange2 = .Rows(x).Cells(y).innertext

    For x = 0 To cellrangex
        For y = 0 To cellrangey
            cellrange = cellrange2
        Next y
    Next x
End With


End Sub

Solution

  • You'll really need to analyze the contents of the web page you are scraping with a decent HTML editor. I would suggest navigating to the page in question in chrome and using F12 to open it's developer tool. In the "Elements" tab you can quickly see which HTML is producing which part of the page (open both the page and the developer tools next to each other).

    You'll notice as you head into the comments that the text of each comment is inside a <p> tag and each <p> tag is inside a <div>. We are looking for patterns, so this is a good start.

    You'll also notice that each one of those <div> tags has a class of md. So... Lets load all of the pages <div> tags into an object and then look for the ones that have a className that contains "md":

    Sub getRedditData()

    Dim x As Long, y As Long
    Dim htm As Object
    
    Set htm = CreateObject("htmlFile")
    
    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
        .send
        htm.body.innerhtml = .responsetext
    End With
    
    Set Divelements = htm.getElementsByTagName("div")
    
    For Each DivElement In Divelements
        If InStr(1, DivElement.ClassName, "md") Then
            'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
            Debug.Print DivElement.InnerText
        End If
    Next
    

    End Sub

    With that you'll see all of the comments stuck in the Immediate window (go to View>>Immediate Window) so you can see this debug output.


    After skipping around the nodes it looks like you can navigate up a couple of elements and back down the tree to get the username:

    Sub getRedditData()
    
        Dim x As Long, y As Long
        Dim htm As Object
    
        Set htm = CreateObject("htmlFile")
    
        With CreateObject("msxml2.xmlhttp")
            .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
            .send
            htm.body.innerhtml = .responsetext
        End With
    
        Set Divelements = htm.getElementsByTagName("div")
    
    
        On Error Resume Next
    
        For Each divElement In Divelements
            If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
                Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
    
                'Print the name and the comment
                Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText
    
            End If
        Next
    
    End Sub
    

    To print this out to the sheet just point to a cell instead of the debug.print immediate window. Something like:

    Sub getRedditData()
    
        Dim x As Long, y As Long
        Dim htm As Object
        Dim ws As Worksheet, wsCell As Integer
    
        'set the worksheet to print to and the first row to start printing.
        Set ws = Sheets("Sheet1")
        wsCell = 1
    
        Set htm = CreateObject("htmlFile")
    
        With CreateObject("msxml2.xmlhttp")
            .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
            .send
            htm.body.innerhtml = .responsetext
        End With
    
        Set Divelements = htm.getElementsByTagName("div")
    
    
        On Error Resume Next
    
        For Each divElement In Divelements
            If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
                Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
    
                'Print the name and the comment to ws sheet columns 1 and 2
                ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
                ws.Cells(wsCell, 2).Value = divElement.InnerText
    
                'iterate to the next row
                wsCell = wsCell + 1
    
            End If
        Next
    End Sub