Search code examples
rseleniumweb-scrapingrseleniumheadless-browser

Is there a way to count characters on a web page opened in a browser application from R


I have saved pages web pages in text (as .txt files), lots of them. These are public profile pages from a social media site. I want to do a rough measure of how much stuff is on these profile pages. When I save these text files as .html, then open them in a browser, I can see the profile presented. But the text file is a poor indication of how developed the content is on the profile page. If I do character counts on this, it is completely uncorrelated to how developed the viewable profile is (so I learned that html files are such are not good proxies of what shows up when you view the file, since there is a lot of text that does not get rendered in browser windows).

The typical parsing functions from r to extract from .html files seems to drop a lot of the content - I think these profile pages are not very well structured.

I can open these files in an application like chrome from R. But is there a way (programmatically from R) to cut/paste the text rendered in Chrome to another file, as a way of measuring the text that appears in these profiles? I would like to create something automated from R, and loop it.

I'll place a dropbox link to example files (input and output) here -> https://www.dropbox.com/sh/4fqxwbj74tnfaxq/AACtexD7OVYYrMoTDrudbacba?dl=0. In the file, "test2_simple_pagecode.txt", this has the page source code of a sample profile. One could change this to a .html extension, and bring it up in a browser and view the page. What I want to do is bring that file up in a browser window, then cut and paste the text of the entire page into a separate file like the example in "test2_simple_cutpaste.txt". This way, the new file only has words that are actually seen in the profile.


Solution

  • This page relies heavily on javascript to render the page. I suggest looking into rselenium to process the page. RSelenium will be able to process the javascript and you would be able to use the "rvest" package to extract the information of interest.

    Here is very quick and very dirty way to extract the information stored in the person’s profile, but there is also a lot of extraneous information stored there also.

    It appears that the information in profile is stored as JSON data in a comment in the html code. The example below extracts that comment, removes the unicode character and parses the JSON data.

    lines <-readLines("test2_simple_pagecode.txt")
    alllines <- paste(lines, collapse = " ")
    
    library(stringr)
    
    output<-stringr::str_extract(alllines, "<!--\\{\"content\"\\:\\{\"Notes\".+?-->")
    nchar(output)
    
    output2<-gsub("\\\\u002d", " ", output)
    jsonlite::parse_json(substr(output2, 5, nchar(output2)-3))