Search code examples
rquantitative-finance

How To Return Query For Name of Stock Ticker From Corporation Name In R


Doing a project where I need to scrape https://www.sec.gov/divisions/enforce/friactions/friactions2017.shtml.

Basically I have compiled a list of the SEC AAER releases, which ends up being a list of private and public companies. What I need to do is to return the ticker from the corporation. Any idea of R packages that would be useful for this.

As an example, I would want to have "PCRFY" return for Panasonic Corporation. However, this might be an issue: there are two listings for KPMG, one being just "KPMG" and the other being "KPMG Inc." How can I make sure that both queries return a result?

An example of an equation would be:

    returnTicker(("Panasonic Corporation","Apple Corporation")) 

Which would return:

    ("PCRFY","APPL")

Solution

  • Hopefully this comes close to what you need. It doesn't use fuzzy matching, but it should have comparable results.

    It is partially adapted from the answer to this question.

    # The TTR package includes stock symbols and names for NASDAQ, NYSE, and AMEX
    library(TTR)
    
    master <- TTR::stockSymbols()[,c('Name', 'Symbol')]
    
    # We are going to clean up the company names by removing some unimportant words.
    # Replace the words ' Incorporated', ' Corporated', and ' Corporation' with '' (no text), and put results in master$clean.
    master <- cbind(master, clean = gsub(' Incorporated| Corporated| Corporation', '', master$Name))
    
    # Some further cleaning of the master$clean column (the straight line | seperates the strings we are removing)...
    master$clean <- gsub(', Inc|, Inc.| Inc| Inc.| Corp|, Corp| Corp.|, Corp.| Ltd.| Ltd', '', master$clean)
    
    # Clean some special characters. For explanations, check out http://www.endmemo.com/program/R/gsub.php
    master$clean <- gsub('\\(The\\)|[.]|\'|,', '', master$clean)
    
    # You should also do the 3 cleaning cleaning steps above on your company names as well.
    # Lastly, scroll through your data; you may find some more character strings to remove.
    
    # Create a data frame which would contain your company names....
    yourCompanyNames <- data.frame(name = c('apple', 'microsoft', 'allstate', 'ramp capital'), stringsAsFactors = F)
    
    # This is the important part. Symbols are added to the data frame of yourCompanyNames....
    yourCompanyNames$sym <- sapply(X = yourCompanyNames$name, FUN = function(YOUR.NAME) {
      master[grep(pattern = YOUR.NAME, x = master$clean, ignore.case = T), 'Symbol'] })
    
    #  ------------ END ---------------
    
    # I dunno how much R experience you have, but here is a quick explanation of what is happening, chunk-by-chunk...
    
    # companyNames$sym <-
      # Create a new column in your dataframe for the symbols we will be finding
    
    # sapply(X = yourCompanyNames$name, FUN = function(YOUR.NAME) {
      # sapply() applies a function (found on the next line) to your data (X).
    
    # master[grep(
      # grep() searches for a string in a vector of strings, and will return the indices where it is found. For example...
      # grep('hel', c('hello', 'world', 'help')) returns 1 and 3
    
    # pattern = YOUR.NAME, x = master$clean, ignore.case = T),
      # The pattern which grep() is looking for is YOUR.NAME, which is an individual company name from yourCompanyNames.
      # (Remember, we are moving through yourCompanyNames one-by-one)
      # grep() looks for YOUR.NAME in each of the strings in master$clean, and ignores capitalization of the strings.
    
    # 'Symbol'] })
      # We can simplify the second line to master[grep(), 'Symbol']
      # Since grep() is returning indicies where YOUR.NAME is found in master$clean,
      # the second line gives us the symbols for the companies located at those indicies (rows).
      # Finally, sapply() returns the list of symbols we found, and the list is added to yourCompanyName$sym
    
    
    # Using the 4 example companies from above, we get....
    
    #           name                                                         sym
    # 1        apple                                        AAPL, APLE, DPS, MLP
    # 2    microsoft                                                        MSFT
    # 3     allstate ALL, ALL-PA, ALL-PB, ALL-PC, ALL-PD, ALL-PE, ALL-PF, ALL-PG
    # 4 ramp capital                                                            
    
    # The word 'apple' appeared in multiple names, and 'allstate' has multiple tickers.
    # You may need to clean some of them up using fix(yourCompanyNames)
    

    Hope this helps, or at least puts you on the right path.