Search code examples
pythonhtmlbeautifulsoupjpeg

How to ignore embedded jpeg image data in HTML using BeautifulSoup's getText() method for SEC website


I'm downloading 8-K filings from the SEC website. I'm trying to extract all the text data for sentiment analysis, the problem I am having is that getText() is also picking up all the embedded jpeg image data and treating it as text.

Here is the URL to filing; saving the file as .html will let you view it in the browser. https://www.sec.gov/Archives/edgar/data/2488/0000002488-18-000043.txt

The only solution I have come up with so far is a multi-pass solution, where I have to soup.findAll('html'). Get the various html blocks, for each block use soup.getText(). I have to iterate a few time to capture html. But this method ignores this data in the file, which I need. To fix this, I first have to pull this before running soup.getText(). I was wondering if there is an simpler/cleaner way of doing this.

Thanks!

<SEC-DOCUMENT>0000002488-18-000043.txt : 20180227
<SEC-HEADER>0000002488-18-000043.hdr.sgml : 20180227
<ACCEPTANCE-DATETIME>20180227163108
ACCESSION NUMBER:       0000002488-18-000043
CONFORMED SUBMISSION TYPE:  8-K
PUBLIC DOCUMENT COUNT:      19
CONFORMED PERIOD OF REPORT: 20180227
ITEM INFORMATION:       Results of Operations and Financial Condition
ITEM INFORMATION:       Regulation FD Disclosure
ITEM INFORMATION:       Financial Statements and Exhibits
FILED AS OF DATE:       20180227
DATE AS OF CHANGE:      20180227

FILER:

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         ADVANCED MICRO DEVICES INC
        CENTRAL INDEX KEY:          0000002488
        STANDARD INDUSTRIAL CLASSIFICATION: SEMICONDUCTORS & RELATED DEVICES [3674]
        IRS NUMBER:             941692300
        STATE OF INCORPORATION:         DE
        FISCAL YEAR END:            1227

    FILING VALUES:
        FORM TYPE:      8-K
        SEC ACT:        1934 Act
        SEC FILE NUMBER:    001-07882
        FILM NUMBER:        18645526

    BUSINESS ADDRESS:   
        STREET 1:       2485 AUGUSTINE DRIVE
        CITY:           SANTA CLARA
        STATE:          CA
        ZIP:            95054
        BUSINESS PHONE:     (408) 749-4000

    MAIL ADDRESS:   
        STREET 1:       2485 AUGUSTINE DRIVE
        CITY:           SANTA CLARA
        STATE:          CA
        ZIP:            95054
</SEC-HEADER>
<DOCUMENT>
<TYPE>8-K
<SEQUENCE>1
<FILENAME>a6form8-kasc606disclosurev.htm
<DESCRIPTION>8-K
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>

Solution

  • import requests
    from bs4 import BeautifulSoup
    
    r = requests.get(
        'https://www.sec.gov/Archives/edgar/data/2488/0000002488-18-000043.txt')
    
    soup = BeautifulSoup(r.text, 'html.parser')
    
    for item in soup.findAll('html'):
        print(item.get_text("\n", strip=True))
    

    Output:

    Document
    UNITED STATES
    SECURITIES AND EXCHANGE COMMISSION
    Washington, D.C. 20549
    ____________________
    FORM 8-K
    CURRENT REPORT
    Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934        
    February 27, 2018
    Date of Report (Date of earliest event reported)
    ADVANCED MICRO DEVICES, INC.
    (Exact name of registrant as specified in its charter)
    Delaware
    001-07882
    94-1692300
    (State of Incorporation)
    (Commission File Number)
    (IRS Employer
    Identification Number)
    2485 Augustine Drive
    Santa Clara, California  95054
    (Address of principal executive offices)  (Zip Code)
    (408) 749-4000
    (Registrant’s telephone number, including area code)
    N/A
    (Former Name or Former Address, if Changed Since Last Report)
    Check the appropriate box below if the Form 8-K filing is intended to simultaneously satisfy the filing obligation of the registrant under any of the following provisions:
    ¨
    Written communications pursuant to Rule 425 under the Securities Act (17 CFR 230.425)
    ¨
    Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR 240.14a-12)
    ¨
    Pre-commencement communications pursuant to Rule 14d-2(b) under the Exchange Act (17 CFR 240.14d-2(b))
    ¨
    Pre-commencement communications pursuant to Rule 13e-4(c) under the Exchange Act (17 CFR 240.13e-4(c))
    Indicate by check mark whether the registrant is an emerging growth company as defined in Rule 405 of the Securities Act of 1933 (§230.405 of this chapter) 
    or Rule 12b-2 of the Securities Exchange Act of 1934 (§240.12b-2 of this chapter).  Emerging growth company
    ¨
    If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or 
    revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.
    ¨
    Item 2.02. Results of Operation and Financial Condition.
    Advanced Micro Devices, Inc. (the “Company”) is furnishing in Exhibit 99.1 consolidated statements of operations for 2016 and 2017, quarterly consolidated statements of operations for 2017, segment information for 2016 and 2017, quarterly segment information for 2017, consolidated balance sheets for 2016 and 2017, quarterly consolidated balance sheets for 2017, consolidated statements of cash flows - operating activities for 2016 and 2017, and quarterly consolidated statements of cash flows - operating activities for 2017, associated with the new accounting standard for revenue recognition, ASU No. 2014-09,
    Revenue from Contracts with Customers: Topic 606
    (“ASC 606”).
    Item 7.01 Regulation FD Disclosure.
    The Company adopted ASC 606 in the first quarter of 2018. The Company is furnishing Exhibit 99.1 as supplemental information regarding ASC 606.
    To supplement the Company’s financial results presented on a U.S. Generally Accepted Accounting Principles (“GAAP”) basis, the Company’s Exhibit 99.1 contains non-GAAP financial measures, including non-GAAP gross margin, non-GAAP operating expenses, non-GAAP research and development and marketing, general and administrative expenses, non-GAAP operating income (loss), non-GAAP interest expense, non-GAAP other income (expense), non-GAAP provision (benefit) for income taxes, non-GAAP net income (loss), non-GAAP earnings (loss) per share and free cash flow. The Company believes that the supplemental non-GAAP financial measures assist investors in comparing the Company's core performance by excluding items that it believes are not indicative of the Company’s underlying operating performance. The Company cautions investors to carefully evaluate the financial results calculated in accordance with GAAP and the supplemental non-GAAP financial measures and reconciliations. The Company’s non-GAAP financial measures are not intended to be considered in isolation and are not a substitute 
    for, or superior to, financial measures calculated in accordance with GAAP.   
    The information in this report furnished pursuant to Items 2.02 and 7.01, including Exhibit 99.1 attached hereto, shall not be deemed “filed” for the purposes of Section 18 of the Securities Exchange Act of 1934, as amended (the “Exchange Act”), or otherwise subject to the liabilities of that section. It may only be incorporated by reference in another filing under the Exchange Act or the Securities Act of 1933, as amended (the "Securities Act"), if such subsequent filing specifically references the information furnished pursuant to Items 2.02 and 7.01 of this Current Report on Form 8-K.
    Forward Looking Statements.
    This Current Report on Form 8-K, including its exhibits, contains “forward-looking” statements within the meaning of Section 21E of the Exchange Act and Section 27A of the Securities Act. Forward-looking statements reflect current expectations and projections about future events, including AMD’s expectations regarding its financial outlook for fiscal 2018, AMD’s focus on growing revenue 
    and increasing profitability in fiscal 2018, and AMD's expected timing of the 
    completion of deliverables for a development and intellectual property licensing agreement and the ability of AMD to recognize revenue under such agreement 
    at the expected time, and thus involve uncertainty and risk. It is possible that future events may differ from expectations due to a variety of risks and other factors such as those described in AMD’s Annual Report on Form 10-K for the fiscal year ended December 30, 2017, as filed with the U.S. Securities and Exchange Commission. It is not possible to foresee or identify all such factors. Any forward-looking statements in this Current Report on Form 8-K, including its exhibits, are based on certain assumptions and an analyses made in light 
    of AMD’s experience and perception of historical trends, current conditions, expected future developments, and other factors it believes are appropriate in 
    the circumstances. Forward-looking statements are not a guarantee of future performance and actual results or developments may differ materially from expectations. AMD does not intend to update any particular forward-looking statements contained in this Current Report on Form 8-K and its exhibits.
    Item 9.01 Financial Statements and Exhibits.
    (d) Exhibits.
    EXHIBIT INDEX
    Exhibit No.
    Description
    99.1
    AMD Adoption of ASC 606 Revenue Recognition Accounting Standard - February 27, 2018
    SIGNATURES
    Pursuant to the requirements of the Securities Exchange Act of 1934, as amended, the registrant has duly caused this report to be signed on its behalf by the undersigned hereunto duly authorized.
    Date: February 27, 2018
    ADVANCED MICRO DEVICES, INC.
    By:
    /s/ Harry A. Wolin
    Name:
    Harry A. Wolin
    Title:
    Senior Vice President, General Counsel and
    Corporate Secretary
    Exhibit
    AMD Adoption of ASC 606 Revenue Recognition Accounting Standard
    February 27, 2018
    Reconciliation for all non-GAAP financial measures discussed in this document 
    to the most directly comparable GAAP financial measures is included below     
    .
    New Revenue Recognition Accounting Standard
    AMD adopted the new revenue recognition accounting standard, ASC 606, effective Q1 2018.  ASC 606 is effective for all public companies for annual reporting periods beginning after December 15, 2017.
    We adopted the new revenue recognition accounting standard under the “full retrospective” method, meaning that adjusted financials for 2016 and 2017 are being provided as though ASC 606 was effective in those prior periods.  This method of adoption makes it easier for investors as we provide 2018 guidance, actual results going forward and comparative prior results under one consistent standard.  There is no change to our underlying business guidance under the new 
    standard and we remain focused on growing revenue and increasing profitability in 2018.  From Q1 2018 onwards all AMD financial results will be reported under the new revenue recognition accounting standard with prior period financial results adjusted for ASC 606 as provided in this document.
    The new revenue accounting standard primarily impacts AMD revenue recognition 
    for:
    •
    channel shipments on a sell-in basis (CPUs and GPUs),
    •
    inventory of custom products with a non-cancellable purchase order (semi-custom products), and
    •
    transactions that involve development and licensing agreements.
    AMD Adoption of ASC 606 Revenue Recognition Standard
    Page
    1
    February 27, 2018
    Under the new standard, revenue from sales to distributors will be recognized 
    as revenue upon the shipment of the product to the distributors (sell-in).    
    •
    Previously revenue recognition of sales to distributors was upon reported resale of the product by the distributors to their customers (sell-through).      
    Semi-custom products under non-cancellable purchases orders will be recognized as revenue based on the value of the inventory and expected margin.
    •
    Previously semi-custom product revenue was recognized upon shipment.
    Revenue associated with certain development and intellectual property licensing agreements will be recognized upon transfer of control of the intellectual property license.
    •
    Previously the fair value of these agreements was divided into an R&D credit for specific development work as the expenses were incurred and licensing revenue upon completion of the deliverables.
    Revenue recognition related to all other revenue streams remains substantially unchanged under the new standard.
    Summary of ASC 606 Impact to 2016 Financials
    •
    2016 GAAP and Non-GAAP Results:
    •
    2016
    revenue
    is $47 million higher driven by a net build in channel and semi-custom product inventory.
    •
    2016
    gross margin
    percentage does not change and gross margin
    dollars increase by $5 million due to higher revenue.
    •
    There is no impact to
    net loss per share
    .
    AMD Adoption of ASC 606 Revenue Recognition Standard
    Page
    2
    February 27, 2018
    Summary of ASC 606 Impact to 2017 Financials
    •
    2017 GAAP and non-GAAP Results:
    •
    2017
    revenue
    is $76 million lower driven by a net drain in channel and semi-custom product 
    inventory.
    •
    Revenue in each of the quarters in 2017 is adjusted based on whether there is 
    a net drain or net build of channel and semi-custom product inventory.        
    •
    2017
    gross margin
    percentage does not change and gross margin
    dollars decrease by $36 million due primarily to lower channel revenue.       
    •
    Gross margin dollars for each quarter in 2017 are adjusted based on higher or 
    lower channel and semi-custom product revenue
    .
    •
    Operating expenses
    (OPEX) for 2017 are higher by $41 million primarily due to the absence of $36 
    million of R&D credits related to a development and intellectual property licensing agreement signed in 2017.  It is expected that the deliverables for this agreement will be completed in 2018 and revenue will be recognized upon transfer of the license.  Marketing, general and administrative expenses increase slightly due to a shift in the timing of recognition of marketing fund expenses.
    •
    OPEX for each quarter in 2017
    increases primarily due to the absence of R&D credits.
    •
    Provision (benefit) for income taxes
    adjustment for 2017 relates to the reduction of withholding tax expense associated with the absence of R&D credits.
    •
    Q3 and Q4 2017 taxes were also impacted by ASC 606 adjustments.
    •
    Earnings (loss) per share
    for 2017 is lower by $0.07 due to the impact of lower gross margin dollars of 
    approximately $(0.035) as a result of lower revenue and the impact of the absence of $36 million of R&D credits of approximately $(0.035).
    AMD Adoption of ASC 606 Revenue Recognition Standard
    Page
    3
    February 27, 2018
    •
    Earnings (loss) per share for each quarter in 2017 is adjusted based primarily on changes to operating income (loss).
    Summary of the impact of ASC 606 on Reportable Segments in 2016 and 2017      
    Computing and Graphics:
    •
    Revenue
    increases in 2016 by $21 million due to a net build in channel inventory.  Revenue is lower in 2017 by $52 million due to a net drain in channel inventory. 
    •
    Revenue in each of the quarters for 2017 is adjusted based on whether there is a net increase or decrease in channel revenue
    .
    •
    Operating Income (Loss)
    decreases $5 million in 2016 primarily due to slightly higher operating expenses.  Operating income (loss) decreases $55 million in 2017 primarily due to lower revenue and the absence of R&D credits.
    •
    Operating income (loss) in each of the quarters for 2017 is adjusted based on 
    the impact of revenue and operating expenses and by the absence of R&D credits
    .
    Enterprise, Embedded and Semi-Custom:
    •
    Revenue
    increases $26 million in 2016 due primarily to an increase in semi-custom product revenue and decreases $24 million in 2017 due primarily to a decrease in semi-custom product revenue.
    •
    Revenue in each of the quarters for 2017 is adjusted based on whether there is a net increase or decrease in semi-custom product revenue
    .
    •
    Operating Income (Loss)
    increases $4 million in 2016 and decreases $22 million in 2017 primarily due to the impact of semi-custom product revenue recognition.  In addition, 2017 is impacted by the absence of R&D credits.
    AMD Adoption of ASC 606 Revenue Recognition Standard
    Page
    4
    February 27, 2018
    •
    Operating income (loss) in each of the quarters for 2017 is adjusted based on 
    the impact of revenue and the absence of R&D credits
    .
    Summary of the key impact on Balance Sheet items under ASC 606 for Annual 2016 & Annual and Quarterly 2017
    •
    Accounts receivable
    increases in all periods primarily due to the acceleration in timing of semi-custom product revenue
    .
    •
    Inventory
    decreases in all periods primarily due to the acceleration in timing of semi-custom product revenue
    .
    •
    Other current liabilities
    increases throughout 2017 due to the reclassification of R&D credits to the balance sheet as deferred revenue.  There is no change to 2016.
    •
    Deferred income on shipments to distributors
    line item, which represented the deferral of income for shipments to distributors previously recognized as revenue upon reported sale by our distributors (sell
    -
    through), goes away under ASC 606 as channel revenue is recognized upon shipment (sell-in) under ASC 606.
    2016 and 2017 Cash Flow Statements
    There is no impact on cash flow during any period from the adoption of ASC 606.
    In summary the ASC 606 adjusted 2016 and 2017 financial results, provided in this document, reflect the effects of this new revenue recognition accounting standard. There is no change to our underlying business guidance under the new 
    standard and we remain focused on growing revenue and increasing profitability in 2018.
    Investor Contacts:
    Ruth Cotter
    Laura Graves
    Alina Ostrovsky
    408-749-3887
    408-749-5467
    408-749-6688
    ruth.cotter@amd.com
    laura.graves@amd.com
    alina.ostrovsky@amd.com
    AMD Adoption of ASC 606 Revenue Recognition Standard
    Page
    5
    February 27, 2018