I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from Lexis Nexis.
For each document I want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.
So I found this question which is similar to my problem Extract unknown words from a recurrent pattern
But I have several problems getting to the solution.
First off, I don't know how to correctly upload my Data from the single documents into R for further processing with a regex formula.
Secondly I have problems with understanding and applying the sub formula myself. See this formula, which I found:
sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])
I have difficulties adapting the pattern part of sub (the first part I assume) to my problem. Also I don't know what the second part means. For the third part I know that this is the source of the text but I don't know what [,5] means.
Here the code in full:
tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])
also a txt file I use: https://www.dropbox.com/s/e24ywni8z3s8wqk/SolarWorldAG_25.03.2008_1.HTML.txt?dl=0
My knowledge of R is currently Swirl courses and specifically on text mining https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html
The text mining package will not help much if all you need are the dates, but the regular expression capabilities of R are pretty useful.
To achieve specifically what you asked for, try gregexpr
w/ regmatches
:
fileName <- "~/Downloads/SolarWorldAG_25.03.2008_1.HTML.txt"
mytxt <- readChar(fileName, file.info(fileName)$size)
regmatches(mytxt, regexec("UPDATE:",mytxt))
regmatches(mytxt, gregexpr(
"UPDATE: [A-Za-z]{0,10} ?[0-9]{1,2}\\. [A-Z]{1}[a-z|ä]{2,8} [0-9]{4}",
mytxt))
It says, in English: look for the literal UPDATE:
followed by a space, followed by an optional set of 0 to 10 characters corresponding to the (optional) day of the week in german, an optional space, a 1 to 2 digit number, a period (escaped by a \\
, because reasons) a capital letter, all lowercase letters of the english alphabet and ä, in a sequence of 2 to 8 letters, followed by a space, followed by a 4 digit number.
You get:
[1] "UPDATE: 18. März 2008" "UPDATE: 14. März 2008"
[3] "UPDATE: 13. März 2008" "UPDATE: 14. März 2008"
[5] "UPDATE: 28. Februar 2008" "UPDATE: 20. Februar 2008"
...
[189] "UPDATE: 31. Dezember 2004" "UPDATE: 3. Januar 2005"
[191] "UPDATE: 9. Dezember 2004" "UPDATE: 23. November 2004"