I have data with 2 million rows and 7 columns. There are also thousands of different companies and indicators.
A sample of the data is:
company_id indicator_id X2011
1 1000045 AccountsPayableAndAccruedLiabilitiesCurrentAndNoncurrent 6612429
2 1000045 AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment NA
3 1000045 AdjustmentForAmortization NA
4 1000045 Assets 257236034
5 1000045 AssetsHeldForSaleAtCarryingValue 1373001
6 1000045 AssetsOfDisposalGroupIncludingDiscontinuedOperation NA
X2012 X2013 X2014 X2015
1 7405579 8924919 7841070 5839000
2 2111343 2242703 2236449 2462000
3 -11482251 -13490892 -13852305 -13811000
4 263835468 283429579 302528591 325309000
5 1203664 1696330 NA NA
6 NA NA 1746887 2148000
and some of the indicators I want are:
asset_indicators
[1] "DeferredIncomeTaxExpenseBenefit"
[2] "CashAndCashEquivalentsAtCarryingValue"
[3] "CashAndCashEquivalentsPeriodIncreaseDecrease"
[4] "NetCashProvidedByUsedInOperatingActivities"
[5] "NetCashProvidedByUsedInFinancingActivities"
[6] "NetCashProvidedByUsedInInvestingActivities"
[7] "PropertyPlantAndEquipmentNet"
I am trying to get a new dataframe with just certain rows. Currently, I have this code:
year <-rbind(statement[which(statement$indicator_id==asset_indicators[1]),],
statement[which(statement$indicator_id==asset_indicators[2]),],
statement[which(statement$indicator_id==asset_indicators[3]),],
statement[which(statement$indicator_id==asset_indicators[4]),],
statement[which(statement$indicator_id==asset_indicators[5]),],
statement[which(statement$indicator_id==asset_indicators[6]),],
statement[which(statement$indicator_id==asset_indicators[7]),] )
I'm wondering if there's a way to do this more efficiently? I tried making a for loop but it took hours to run while this takes just a few seconds. It would be nice if it updated automatically if I changed the number of indicators to use.
Does anyone have any ideas how to do this?
Try %in%
:
year <- statement[statement$indicator_id %in% asset_indicators,]