Search code examples
rweb-scrapingtxtedgar

R: reading old 13F txt files from SEC Edgar database using R edgar package


Hi I'm trying to read the 13F filings in the SEC edgar database using the R edgar package

The challenge I have is the filings I'm looking at is the old filing (~year 2000) https://www.sec.gov/edgar/browse/?CIK=1087699

They are in crappy txt format, different to today's 13F and unreadable using readtxt functions.

example file is here: https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt

library(edgar)

F13<-
  getFilings(
  cik.no = "0001087699",
  form.type = "13F-HR",
  1999,
  quarter=c(1,2,3),
  useragent="myname@gmail.com"
)

I tried this and R is just telling me it is busy and downloading forever, it is not a very big txt file. So something is wrong. Then when it finally finished it says no filing information found for given CIKs and form type, but I'm clearly looking at the file. If the edgar package is not designed to deal with it, then how can I do it?

My end goad is to have the filings in nice dataframe, columns for stock symbols and prices and rows for stock data. little help please.

Is there any scraping available? I highlighted the lights by inspect in chrome, but they look weird to me (sorry, not good at scraping at all).


Solution

  • I parsed the file you provided as an example here. I first copied the data from the file to a txt file. The file copied.txt needs to be in the current working directory. This could give you an idea how to proceed.

    library(tidyverse)
    
    df <- read_file("copied.txt") %>%
      # trying to extract data only from the table
      (function(x){
        tbl_beg <- str_locate(x, "Managers Sole")[2] + 1
        tbl_end <- str_locate(x, "\r\n</TABLE>")[1]
        str_sub(x, tbl_beg, tbl_end)
        }) %>%
      # removing some unwanted characters from the beginning and the end of the extracted string
      str_sub(start = 4, end = -3) %>%
      # splitting for individual lines
      str_split('\"\r\n\"') %>% unlist() %>%
      # removing broken line break
      str_remove("\r\n") %>%
      # replacing the original text where there are spaces with one, where there is underscore
      # the reason for that is that I need to split the rows into columns using space
      str_replace_all("Sole   Managers Sole", " Managers_Sole") %>%
      # removing extra spaces
      str_squish() %>%
      # reversing the order of the line (I need to split from the right because the company name contains additional spaces)
      # if the company name is the last one, it is okey that there are additional spaces
      stringi::stri_reverse() %>%
      str_split(pattern = " ", n = 6, simplify = T) %>%
      # making the order to the original one
      apply(MARGIN = 2, FUN = stringi::stri_reverse) %>%
      as_tibble() %>%
      select(c(6:1)) %>%
      set_names(nm = c("name_of_issuer", "title_of_cl", "cusip_number", "fair_market_value", "shares",  "shares_of_princip_mngrs"))
    
    # A tibble: 47 x 6
       name_of_issuer   title_of_cl cusip_number fair_market_value shares  shares_of_princip_mngrs
       <chr>            <chr>       <chr>        <chr>             <chr>   <chr>                  
     1 America Online   COM         02364J104    2,940,000         20,000  Managers_Sole          
     2 Anheuser Busch   COM         35229103     3,045,000         40,000  Managers_Sole          
     3 At Home          COM         45919107     787,500           5,000   Managers_Sole          
     4 AT&T             COM         1957109      5,985,937         75,000  Managers_Sole          
     5 Bank Toyko       COM         65379109     700,000           50,000  Managers_Sole          
     6 Bay View Capital COM         07262L101    14,958,437        792,500 Managers_Sole          
     7 Broadcast.com    COM         111310108    2,954,687         25,000  Managers_Sole          
     8 Chase Manhattan  COM         16161A108    10,578,750        130,000 Managers_Sole          
     9 Chase Manhattan  4/85C       16161A9DQ    59,375            500     Managers_Sole          
    10 Cisco Systems    COM         17275R102    4,930,312         45,000  Managers_Sole