Search code examples
rregextext-miningtm

Extracting unknown dates from txt/HTML files using R


I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from Lexis Nexis.

For each document I want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.

So I found this question which is similar to my problem Extract unknown words from a recurrent pattern

But I have several problems getting to the solution.
First off, I don't know how to correctly upload my Data from the single documents into R for further processing with a regex formula.

Secondly I have problems with understanding and applying the sub formula myself. See this formula, which I found:

sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

I have difficulties adapting the pattern part of sub (the first part I assume) to my problem. Also I don't know what the second part means. For the third part I know that this is the source of the text but I don't know what [,5] means.

Here the code in full:

tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

also a txt file I use: https://www.dropbox.com/s/e24ywni8z3s8wqk/SolarWorldAG_25.03.2008_1.HTML.txt?dl=0

My knowledge of R is currently Swirl courses and specifically on text mining https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html


Solution

  • The text mining package will not help much if all you need are the dates, but the regular expression capabilities of R are pretty useful.

    To achieve specifically what you asked for, try gregexpr w/ regmatches:

    fileName <- "~/Downloads/SolarWorldAG_25.03.2008_1.HTML.txt"
    mytxt <- readChar(fileName, file.info(fileName)$size)
    regmatches(mytxt, regexec("UPDATE:",mytxt))
    
    regmatches(mytxt, gregexpr(
    "UPDATE: [A-Za-z]{0,10} ?[0-9]{1,2}\\. [A-Z]{1}[a-z|ä]{2,8} [0-9]{4}", 
    mytxt))
    

    It says, in English: look for the literal UPDATE: followed by a space, followed by an optional set of 0 to 10 characters corresponding to the (optional) day of the week in german, an optional space, a 1 to 2 digit number, a period (escaped by a \\, because reasons) a capital letter, all lowercase letters of the english alphabet and ä, in a sequence of 2 to 8 letters, followed by a space, followed by a 4 digit number.

    You get:

    [1] "UPDATE: 18. März 2008"      "UPDATE: 14. März 2008"     
    [3] "UPDATE: 13. März 2008"      "UPDATE: 14. März 2008"     
    [5] "UPDATE: 28. Februar 2008"   "UPDATE: 20. Februar 2008" 
    ...
    [189] "UPDATE: 31. Dezember 2004"      "UPDATE: 3. Januar 2005"        
    [191] "UPDATE: 9. Dezember 2004"       "UPDATE: 23. November 2004"