I have a data.frame with 11717 obs. of 15 variables. See below:
$ SCC : Factor w/ 11717 levels "10100101","10100102",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Data.Category : Factor w/ 6 levels "Biogenic","Event",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Short.Name : Factor w/ 11238 levels "","2,4-D Salts and Esters Prod /Process Vents, 2,4-D Recovery: Filtration",..: 3283 3284 3293 3291 3290 3294 3295 3296 3292 3289 ...
$ EI.Sector : Factor w/ 59 levels "Agriculture - Crops & Livestock Dust",..: 18 18 18 18 18 18 18 18 18 18 ...
$ Option.Group : Factor w/ 25 levels "","C/I Kerosene",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Option.Set : Factor w/ 18 levels "","A","B","B1A",..: 1 1 1 1 1 1 1 1 1 1 ...
$ SCC.Level.One : Factor w/ 17 levels "Brick Kilns",..: 3 3 3 3 3 3 3 3 3 3 ...
$ SCC.Level.Two : Factor w/ 146 levels "","Agricultural Chemicals Production",..: 32 32 32 32 32 32 32 32 32 32 ...
$ SCC.Level.Three : Factor w/ 1061 levels "","100% Biosolids (e.g., sewage sludge, manure, mixtures of these matls)",..: 88 88 156 156 156 156 156 156 156 156 ...
$ SCC.Level.Four : Factor w/ 6084 levels "","(NH4)2 SO4 Acid Bath System and Evaporator",..: 4455 5583 4466 4458 1341 5246 5584 5983 4461 776 ...
$ Map.To : num NA NA NA NA NA NA NA NA NA NA ...
$ Last.Inventory.Year: int NA NA NA NA NA NA NA NA NA NA ...
$ Created_Date : Factor w/ 57 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Revised_Date : Factor w/ 44 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Usage.Notes : Factor w/ 21 levels ""," ","includes bleaching towers, washer hoods, filtrate tanks, vacuum pump exhausts",..: 1 1 1 1 1 1 1 1 1 1 ...
I am trying to make a search for the words "Combustion" and "Coal" and create a subset showing only where "Combustion" and "Coal" are combined in the same sentence OR the same row anywhere in the data.frame:
example of the words used in same sentence:
Fuel Comb - Electric Generation - Coal.
example of the words used in same row / different columns:
see screenshot (I don't have enough creds to attach a img). [screenshot][1]
Using RStudio search shows: 675 results for "Comb" and 251 results for "Coal". So the final combination should be equal or less than 251 if I'm correct.
I tried using grep
and grepl
. However the only way for me to use these functions is to repeat the process across each column before creating the subset (using match
function for instance).
I find this to be a time consuming process. Would you have a better one?
[1]: https://i.sstatic.net/YJr5B.png
I am assuming that by "same sentence" you mean that the words your are searching for are within the same string in one column?
If so, given how I read your description of the problem, I am also assuming you just want to subset the data frame by the rows that contain both words whether they appear in the same sentence or in different columns of the same row. In either case it appears you just want to extract those rows that have both words.
If so, then one way you can do this is to concatenate all the columns together per row into one long string/sentence per row of the data frame and then grepl for your key words in the longer string (using one grepl per word/phrase). This worked for me on ~100k rows quickly (although I reduced the number of columns):
df <- data.frame(A=c("Comb","Comb Fuel","Comb",rep("None",1e5)),B=c("Fuel","Gas","None",rep("None",1e5)))
dfp <- do.call(paste,df)
df[grepl("Comb",dfp) & grepl("Fuel",dfp),]