I am trying to extract the date from multiple PDF's to create a date column in a dataset.
I have a folder holding all the pdf's and am trying to do a topic modelling over a time period, hence I need to extract the dates.
Below is the dataset I have just containing the filenames.
# A tibble: 260 x 1
filename
<chr>
1 ./2012.01.18.pdf
2 ./2012.02.07.pdf
3 ./2012.03.12.pdf
4 ./2012.03.26.pdf
5 ./2012.04.02.pdf
6 ./2012.04.04.pdf
7 ./2012.04.19.pdf
8 ./2012.05.01.pdf
9 ./2012.05.07.pdf
10 ./2012.06.14.pdf
Tried "as.Date" with no luck, as I am unable to extract the dates from a file holding the all the PDFs
In the format
, we could specify the extra characters along with the custom format for year (%Y
), month (%m
) and day (%d
)
df$V2 <- as.Date(df$V2, format = "./%Y.%m.%d.pdf")
-output
> df
V1 V2
1 1 2012-01-18
2 2 2012-02-07
3 3 2012-03-12
4 4 2012-03-26
5 5 2012-04-02
6 6 2012-04-04
7 7 2012-04-19
8 8 2012-05-01
9 9 2012-05-07
10 10 2012-06-14
df <- structure(list(V1 = 1:10, V2 = c("./2012.01.18.pdf", "./2012.02.07.pdf",
"./2012.03.12.pdf", "./2012.03.26.pdf", "./2012.04.02.pdf", "./2012.04.04.pdf",
"./2012.04.19.pdf", "./2012.05.01.pdf", "./2012.05.07.pdf", "./2012.06.14.pdf"
)), class = "data.frame", row.names = c(NA, -10L))