I want to capture this download link (href
attribute) in the line 406 from site: https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files, there should be an attribute 'data-entity-substitution' in this html, but when I used html_element
, it gave me a missig value.
Here is my code:
html <- rvest::read_html("https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files")
rvest::html_element(html, "[data-entity-substitution]")
#> {xml_missing}
#> <NA>
In this case there is no need to scrape anything! You can use download.files
to download the zip file and unzip. I read them into a list since the data does not have the same number of columns
library(tidyverse)
temp = tempfile(fileext = '.zip')
url = 'https://www.fda.gov/media/89850/download?attachment'
download.file(url, temp, method = 'auto',
mode = 'wb')
temp2 = tempfile()
dt = unzip(zipfile = temp, exdir = temp2 )
dat = map(dt, \(x) read_delim(x, delim = '\t'))
#> Rows: 59 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ActionTypes_LookupDescription, SupplCategoryLevel1Code, SupplCatego...
#> dbl (1): ActionTypes_LookupID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 74017 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (4): ApplNo, SubmissionType, ApplicationDocsTitle, ApplicationDocsURL
#> dbl (3): ApplicationDocsID, ApplicationDocsTypeID, SubmissionNo
#> dttm (1): ApplicationDocsDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> Rows: 27412 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ApplNo, ApplType, SponsorName
#> lgl (1): ApplPublicNotes
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 62 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): ApplicationDocsType_Lookup_Description
#> dbl (1): ApplicationDocsType_Lookup_ID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 48269 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): ApplNo, ProductNo
#> dbl (1): MarketingStatusID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): MarketingStatusDescription
#> dbl (1): MarketingStatusID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> Rows: 47798 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (6): ApplNo, ProductNo, Form, Strength, DrugName, ActiveIngredient
#> dbl (2): ReferenceDrug, ReferenceStandard
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 28 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): SubmissionClassCode, SubmissionClassCodeDescription
#> dbl (1): SubmissionClassCodeID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 266557 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ApplNo, SubmissionType, SubmissionPropertyTypeCode
#> dbl (2): SubmissionNo, SubmissionPropertyTypeID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 182145 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (5): ApplNo, SubmissionType, SubmissionStatus, SubmissionsPublicNotes, ...
#> dbl (2): SubmissionClassCodeID, SubmissionNo
#> dttm (1): SubmissionStatusDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 22944 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ApplNo, ProductNo, TECode
#> dbl (1): MarketingStatusID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Created on 2024-07-02 with reprex v2.1.0