Parsing XBRL using python

I am working on parsing values from xbrl. I base my code on python-xbrl package, but changed it a bit to suit my needs. The package uses beautifulsoup4

I am using code below to find one value that I am interested in. I am using if statement because different companies use different tag names for the same thing.

For example AAPL uses us-gaap:CostOfGoodsAndServicesSold, and ADBE uses us-gaap:CostOfRevenue.

This code works as intended, giving me the correct value 29924000000

    #COST_GOOD_SOLD
    COST_GOOD_SOLD = xbrl.find_all(name=re.compile("(us-gaap:CostOfGoodsAndServicesSold$)",
                                                   re.IGNORECASE | re.MULTILINE))
    gaap_obj.COST_GOOD_SOLD = self.data_processing(COST_GOOD_SOLD, xbrl, ignore_errors,
                                                   logger, context_ids)
    if gaap_obj.COST_GOOD_SOLD ==0 or gaap_obj.COST_GOOD_SOLD==None:
        COST_GOOD_SOLD = xbrl.find_all(name=re.compile("(us-gaap:CostOfRevenue$)",
                                                           re.IGNORECASE | re.MULTILINE))
        gaap_obj.COST_GOOD_SOLD = self.data_processing(COST_GOOD_SOLD, xbrl, ignore_errors,
                                                       logger, context_ids)

XBRL

<us-gaap:CostOfGoodsAndServicesSold contextRef="eol_PE2035----1510-Q0008_STD_91_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_2D2AD7F5-3575-48A0-9F08-7F1EBE173C23_1_1">29924000000</us-gaap:CostOfGoodsAndServicesSold>

This code returns zero, while I want -1808000000

    #NET_CURR_DEBT
    NET_CURR_DEBT = xbrl.find_all(name = re.compile("(us-gaap:ProceedsFromRepaymentsOfCommercialPaper$)",
                                                    re.IGNORECASE | re.MULTILINE))
    gaap_obj.NET_CURR_DEBT = self.data_processing(NET_CURR_DEBT, xbrl, ignore_errors,
                                                  logger, context_ids)
    if NET_CURR_DEBT==0 or NET_CURR_DEBT==None:
        NET_CURR_DEBT = xbrl.find_all(name = re.compile("(us-gaap:RepaymentsOfLongTermDebtAndCapitalSecurities$)",
                                                        re.IGNORECASE | re.MULTILINE))
        gaap_obj.NET_CURR_DEBT = self.data_processing(NET_CURR_DEBT, xbrl, ignore_errors,
                                                      logger, context_ids)

XBRL

<us-gaap:ProceedsFromRepaymentsOfCommercialPaper contextRef="eol_PE2035----1510-Q0008_STD_273_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_049B4F11-216C-4D4B-A41F-32F1F55F967F_1_32">-1808000000</us-gaap:ProceedsFromRepaymentsOfCommercialPaper>

I have several other values that I am parsing, but they all have the same structure as the code I have attached. My output is a dataframe where first column is value names (COST_GOOD_SOLD, NET_CURR_DEBT, ect) and the second column is values from XML file.

I can't figure out why identical chunks of code don't work. It seems that I am doing the same thing in both cases. Finding a value and storing it.

Solution

One difference is that the if statement checks gaap_obj.COST_GOOD_SOLD in the first case, but just NET_CURR_DEBT in the second.

It's hard to comment further without seeing what self.data_processing actually does, but does your code cope with the fact that the same element may appear multiple times in an XBRL document (differentiated by different contexts)?

As I commented on your previous question (Reading xbrl with python), I wouldn't recommend beautifulsoup for parsing XBRL as its namespace support is incomplete. You'd be better with a proper XBRL library, that will also take care of processing the contexts etc. for you.