Search code examples
pythontextpython-3.5urllib2finance

Text Scraping (from EDGAR 10K Amazon) code not working


I have the below code to scrape some specific word list from the financial statements (US SEC EDGAR 10K) text file. Will highly appreciate if you anyone can help me with this. I have manually cross-checked and found the words in the document, but my code is not finding any word at all. I am using Python 3.5.3. Thanks in advance

Given a URL path for EDGAR 10-K file in .txt format for a company (CIK) in a year this code will perform a word count

#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys

CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
    'anticipate',
    'believe',
    'depend',
    'fluctuate',
    'indefinite',
    'likelihood',
    'possible',
    'predict',
    'risk',
    'uncertain',
    ]
count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
for line in response3:
    elements = line.split()
    for word in words:
     count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count

Here is the script output:

0001018724

2013

https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt

{
    'believe': 0,
    'likelihood': 0,
    'anticipate': 0,
    'fluctuate': 0,
    'predict': 0,
    'risk': 0,
    'possible': 0,
    'indefinite': 0,
    'depend': 0,
    'uncertain': 0,
}

Solution

  • A simplified version of your code seems to work in Python 3.7 with the requests library:

    import requests
    url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
    response = requests.get(url)
    
    words = [your word list above ]
    
    
    count = {}  # is a dictionary data structure in Python
    for elem in words:
        count[elem] = 0
        info = str(response.content)
        count[elem] = count[elem] + info.count(elem)
    
    
    print(count)
    

    Output:

        {'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
     'predict': 6, 'risk': 55, 'uncertain': 38}