Search code examples
htmlxmlcss-selectorsrvest

Cannot select specific element from the html with `rvest`


I want to capture this download link (href attribute) in the line 406 from site: https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files, there should be an attribute 'data-entity-substitution' in this html, but when I used html_element, it gave me a missig value.

enter image description here

Here is my code:

html <- rvest::read_html("https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files")
rvest::html_element(html, "[data-entity-substitution]")
#> {xml_missing}
#> <NA>

Solution

  • In this case there is no need to scrape anything! You can use download.files to download the zip file and unzip. I read them into a list since the data does not have the same number of columns

    library(tidyverse)
    
    temp = tempfile(fileext = '.zip')
    
    url = 'https://www.fda.gov/media/89850/download?attachment'
    
    download.file(url, temp, method = 'auto',
                  mode = 'wb')
    
    temp2 = tempfile()
    
    dt = unzip(zipfile = temp, exdir = temp2 )
    
    
    dat = map(dt, \(x) read_delim(x, delim = '\t'))
    #> Rows: 59 Columns: 4
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (3): ActionTypes_LookupDescription, SupplCategoryLevel1Code, SupplCatego...
    #> dbl (1): ActionTypes_LookupID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 74017 Columns: 8
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr  (4): ApplNo, SubmissionType, ApplicationDocsTitle, ApplicationDocsURL
    #> dbl  (3): ApplicationDocsID, ApplicationDocsTypeID, SubmissionNo
    #> dttm (1): ApplicationDocsDate
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Warning: One or more parsing issues, call `problems()` on your data frame for details,
    #> e.g.:
    #>   dat <- vroom(...)
    #>   problems(dat)
    #> Rows: 27412 Columns: 4
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (3): ApplNo, ApplType, SponsorName
    #> lgl (1): ApplPublicNotes
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 62 Columns: 2
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (1): ApplicationDocsType_Lookup_Description
    #> dbl (1): ApplicationDocsType_Lookup_ID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 48269 Columns: 3
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (2): ApplNo, ProductNo
    #> dbl (1): MarketingStatusID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 5 Columns: 2
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (1): MarketingStatusDescription
    #> dbl (1): MarketingStatusID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Warning: One or more parsing issues, call `problems()` on your data frame for details,
    #> e.g.:
    #>   dat <- vroom(...)
    #>   problems(dat)
    #> Rows: 47798 Columns: 8
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (6): ApplNo, ProductNo, Form, Strength, DrugName, ActiveIngredient
    #> dbl (2): ReferenceDrug, ReferenceStandard
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 28 Columns: 3
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (2): SubmissionClassCode, SubmissionClassCodeDescription
    #> dbl (1): SubmissionClassCodeID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 266557 Columns: 5
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (3): ApplNo, SubmissionType, SubmissionPropertyTypeCode
    #> dbl (2): SubmissionNo, SubmissionPropertyTypeID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 182145 Columns: 8
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr  (5): ApplNo, SubmissionType, SubmissionStatus, SubmissionsPublicNotes, ...
    #> dbl  (2): SubmissionClassCodeID, SubmissionNo
    #> dttm (1): SubmissionStatusDate
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> Rows: 22944 Columns: 4
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (3): ApplNo, ProductNo, TECode
    #> dbl (1): MarketingStatusID
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    

    Created on 2024-07-02 with reprex v2.1.0