I am working on parsing values from xbrl. I base my code on python-xbrl package, but changed it a bit to suit my needs. The package uses beautifulsoup4
I am using code below to find one value that I am interested in. I am using if
statement because different companies use different tag names for the same thing.
For example AAPL uses us-gaap:CostOfGoodsAndServicesSold
, and ADBE uses us-gaap:CostOfRevenue
.
#COST_GOOD_SOLD
COST_GOOD_SOLD = xbrl.find_all(name=re.compile("(us-gaap:CostOfGoodsAndServicesSold$)",
re.IGNORECASE | re.MULTILINE))
gaap_obj.COST_GOOD_SOLD = self.data_processing(COST_GOOD_SOLD, xbrl, ignore_errors,
logger, context_ids)
if gaap_obj.COST_GOOD_SOLD ==0 or gaap_obj.COST_GOOD_SOLD==None:
COST_GOOD_SOLD = xbrl.find_all(name=re.compile("(us-gaap:CostOfRevenue$)",
re.IGNORECASE | re.MULTILINE))
gaap_obj.COST_GOOD_SOLD = self.data_processing(COST_GOOD_SOLD, xbrl, ignore_errors,
logger, context_ids)
<us-gaap:CostOfGoodsAndServicesSold contextRef="eol_PE2035----1510-Q0008_STD_91_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_2D2AD7F5-3575-48A0-9F08-7F1EBE173C23_1_1">29924000000</us-gaap:CostOfGoodsAndServicesSold>
#NET_CURR_DEBT
NET_CURR_DEBT = xbrl.find_all(name = re.compile("(us-gaap:ProceedsFromRepaymentsOfCommercialPaper$)",
re.IGNORECASE | re.MULTILINE))
gaap_obj.NET_CURR_DEBT = self.data_processing(NET_CURR_DEBT, xbrl, ignore_errors,
logger, context_ids)
if NET_CURR_DEBT==0 or NET_CURR_DEBT==None:
NET_CURR_DEBT = xbrl.find_all(name = re.compile("(us-gaap:RepaymentsOfLongTermDebtAndCapitalSecurities$)",
re.IGNORECASE | re.MULTILINE))
gaap_obj.NET_CURR_DEBT = self.data_processing(NET_CURR_DEBT, xbrl, ignore_errors,
logger, context_ids)
<us-gaap:ProceedsFromRepaymentsOfCommercialPaper contextRef="eol_PE2035----1510-Q0008_STD_273_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_049B4F11-216C-4D4B-A41F-32F1F55F967F_1_32">-1808000000</us-gaap:ProceedsFromRepaymentsOfCommercialPaper>
I have several other values that I am parsing, but they all have the same structure as the code I have attached.
My output is a dataframe where first column is value names (COST_GOOD_SOLD, NET_CURR_DEBT, ect)
and the second column is values from XML
file.
I can't figure out why identical chunks of code don't work. It seems that I am doing the same thing in both cases. Finding a value and storing it.
One difference is that the if statement checks gaap_obj.COST_GOOD_SOLD in the first case, but just NET_CURR_DEBT in the second.
It's hard to comment further without seeing what self.data_processing actually does, but does your code cope with the fact that the same element may appear multiple times in an XBRL document (differentiated by different contexts)?
As I commented on your previous question (Reading xbrl with python), I wouldn't recommend beautifulsoup for parsing XBRL as its namespace support is incomplete. You'd be better with a proper XBRL library, that will also take care of processing the contexts etc. for you.