Search code examples
rmacosmetadata

Extract source metadata from downloaded file


I have a bunch of pdf files which I downloaded. Now I want to extract the download url from the file's metadata. How do I do this programmatically? I prefer solutions in R and I'm working on MacOS Mojave.

If you want to reproduce you can [use this file].

enter image description here


Solution

  • While you could have avoided the need for this by using R to programmatically download the PDFs, we can use the xattrs package to get to the data you seek:

    library(xattrs) # https://gitlab.com/hrbrmstr/xattrs (not on CRAN)
    

    Let's see what extended attributes are available for this file:

    xattrs::list_xattrs("~/Downloads/0.-miljoenennota.pdf")
    ## [1] "com.apple.metadata:kMDItemWhereFroms"
    ## [2] "com.apple.quarantine" 
    

    com.apple.metadata:kMDItemWhereFroms looks like a good target:

    xattrs::get_xattr(
      path = "~/Downloads/forso/0.-miljoenennota.pdf",
      name = "com.apple.metadata:kMDItemWhereFroms"
    ) -> from_where
    
    from_where
    ## [1] "bplist00\xa2\001\002_\020}https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdfP\b\v\x8b"
    

    But, it's in binary plist format (yay Apple #sigh). However, since that's "a thing" the xattrs package has a read_bplist() function, but we have to use get_xattr_raw() to use it:

    xattrs::read_bplist(
      xattrs::get_xattr_raw(
        path = "~/Downloads/forso/0.-miljoenennota.pdf",
        name = "com.apple.metadata:kMDItemWhereFroms"
      )
    ) -> from_where
    
    str(from_where)
    ## List of 1
    ##  $ plist:List of 1
    ##   ..$ array:List of 2
    ##   .. ..$ string:List of 1
    ##   .. .. ..$ : chr "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf"
    ##   .. ..$ string: list()
    ##   ..- attr(*, "version")= chr "1.0"
    

    The ugly, nested list is the fault of the really dumb binary plist file format, but the source URL is in there.

    We can get all of them this way (I tossed a bunch of random interactively downloaded PDFs into a directory for this) by using lapply. There's also an example of this in this blog post but it uses reticulate and a Python package to read the binary plist data instead of the built-in package function to do that (said built-in package function is a wrapper to the macOS plutil utility or linux plistutil utility; Windows users can switch to a real operating system if they want to use that function).

    fils <- list.files("~/Downloads/forso", pattern = "\\.pdf", full.names = TRUE)
    
    do.call(
      rbind.data.frame,
      lapply(fils, function(.x) {
    
        xattrs::read_bplist(
          xattrs::get_xattr_raw(
            path = .x,
            name = "com.apple.metadata:kMDItemWhereFroms"
          )
        ) -> tmp
    
        from_where <- if (length(tmp$plist$array$string) > 0) {
          tmp$plist$array$string[[1]]
        } else {
          NA_character_
        }
    
        data.frame(
          fil = basename(.x),
          url = from_where,
          stringsAsFactors=FALSE
        )
    
      })
    ) -> files_with_meta
    
    str(files_with_meta)
    ## 'data.frame': 9 obs. of  2 variables:
    ##  $ fil: chr  "0.-miljoenennota.pdf" "19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "Codebook.pdf" "Elementary-Lunch-Menu.pdf" ...
    ##  $ url: chr  "https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/begrotingen/2016/09/20/miljoenennota-2017/0.-miljoenennota.pdf" "http://eprint.ncl.ac.uk/file_store/production/230123/19180242-D02E-47AC-BDB3-73C22D6E1FDB.pdf" "http://apps.start.umd.edu/gtd/downloads/dataset/Codebook.pdf" "http://www.msad60.org/wp-content/uploads/2017/01/Elementary-February-Lunch-Menu.pdf" ...
    

    NOTE: IRL you should likely do more bulletproofing in the example lapply.