Cannot select specific element from the html with `rvest`

I want to capture this download link (href attribute) in the line 406 from site: https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files, there should be an attribute 'data-entity-substitution' in this html, but when I used html_element, it gave me a missig value.

Here is my code:

html <- rvest::read_html("https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files")
rvest::html_element(html, "[data-entity-substitution]")
#> {xml_missing}
#> <NA>

Solution

In this case there is no need to scrape anything! You can use download.files to download the zip file and unzip. I read them into a list since the data does not have the same number of columns

library(tidyverse)

temp = tempfile(fileext = '.zip')

url = 'https://www.fda.gov/media/89850/download?attachment'

download.file(url, temp, method = 'auto',
              mode = 'wb')

temp2 = tempfile()

dt = unzip(zipfile = temp, exdir = temp2 )


dat = map(dt, \(x) read_delim(x, delim = '\t'))
#> Rows: 59 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ActionTypes_LookupDescription, SupplCategoryLevel1Code, SupplCatego...
#> dbl (1): ActionTypes_LookupID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 74017 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr  (4): ApplNo, SubmissionType, ApplicationDocsTitle, ApplicationDocsURL
#> dbl  (3): ApplicationDocsID, ApplicationDocsTypeID, SubmissionNo
#> dttm (1): ApplicationDocsDate
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 27412 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ApplNo, ApplType, SponsorName
#> lgl (1): ApplPublicNotes
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 62 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): ApplicationDocsType_Lookup_Description
#> dbl (1): ApplicationDocsType_Lookup_ID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 48269 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): ApplNo, ProductNo
#> dbl (1): MarketingStatusID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): MarketingStatusDescription
#> dbl (1): MarketingStatusID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 47798 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (6): ApplNo, ProductNo, Form, Strength, DrugName, ActiveIngredient
#> dbl (2): ReferenceDrug, ReferenceStandard
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 28 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): SubmissionClassCode, SubmissionClassCodeDescription
#> dbl (1): SubmissionClassCodeID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 266557 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ApplNo, SubmissionType, SubmissionPropertyTypeCode
#> dbl (2): SubmissionNo, SubmissionPropertyTypeID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 182145 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr  (5): ApplNo, SubmissionType, SubmissionStatus, SubmissionsPublicNotes, ...
#> dbl  (2): SubmissionClassCodeID, SubmissionNo
#> dttm (1): SubmissionStatusDate
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 22944 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): ApplNo, ProductNo, TECode
#> dbl (1): MarketingStatusID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

^{Created on 2024-07-02 with reprex v2.1.0}