Creating Multiple Covariates from Partial Matches of a Character Variable in R

I have a large data frame that contains an alphanumeric character variable, specifically it contains information on breed composition, from which I need to create covariates of breed fraction.

The breed composition column contains over 7000 combinations of breeds and is of varying length (i.e. some animals have 2 breeds, some have 10). The breeds are always identified by a two letter code and the fraction of that breed is the coefficient that follows it divided by the sum of all breed coefficients (coeftotal) for that animal.

I'm looking for a method to take the coefficients from this variable (breed) and make covariates corresponding to proportions of 7 specific breeds (SU,DP,RV,RI,CD,PO,HA). There are many more breed codes present in the data and some animals may even have none of the breeds of interest. The data frame contains over 1 million records, and I have not been able to find an efficient solution to my problem that does not involve an endless number of grepl /if else statements for each specific breed code and each coefficient of interest (ex. SU1 to SUx). Furthermore, the problem is complicated by the fact that the coefficients do not sum to the same number. An example of my dataframe and the desired output is below. Any ideas are appreciated!

   id <- c(1:8)
   breed <- c("SU1","DP1RI1","DP1RI1RV1SU1","DP3XX1","SU9RV7","XX1","DP7XX1","SU32RV16DP8RI8")
   sheep <- data.frame(id,breed)

   id    breed           coeftot     SU     DP     RV     RI     CD     PO     HA
   1     SU1             1           1      0      0      0      0      0      0
   2     DP1RI1          2           0      0.5    0      0.5    0      0      0
   3     DP1RI1RV1SU1    4           0.25   0.25   0.25   0.25   0      0      0
   4     DP3XX1          4           0      0.75   0      0      0      0      0
   5     SU9RV7          16          0.5625 0      0.4375 0      0      0      0
   6     XX1             1           0      0      0      0      0      0      0
   7     DP7XX1          8           0.875  0      0      0      0      0      0
   8     SU32RV16DP8RI8  64          0.5    0.125  0.25   0.125  0      0      0

Solution

If you need memory and speed efficiency, the data.table package is good. stringi helps a lot with string manipulation.

library(stringi)

breed_codes <- unique(unlist(stri_extract_all_regex(
  sheep[["breed"]], "[A-Z]+"
)))
breed_codes
# "SU" "DP" "RI" "RV" "XX"
patterns <- sprintf("(?<=%s)\\d+", breed_codes)
patterns
# "(?<=SU)\\d+" "(?<=DP)\\d+" "(?<=RI)\\d+" "(?<=RV)\\d+" "(?<=XX)\\d+"

First we use a regular expression to extract all breed codes in the subject's breed sets, which are sequential capital letters ([A-Z]+). Next, we'll make a regular expression to capture the coefficients for each of them.

We want to capture any number of digits (\\d+) preceded by each breed code ((?<=SU)). We'll go over each breed and assign it a column using the digits captured by the pattern. If the subject's breed set doesn't have a code, then we'll set it as 0.

library(data.table)

setDT(sheep)
set(
  sheep,
  j = breed_codes,
  value = lapply(
    patterns,
    function(pat) {
      digits <- stri_extract_first_regex(sheep[["breed"]], pat)
      digits[is.na(digits)] <- "0"
      breed_coef <- as.integer(digits)
      breed_coef
    }
  )
)

Finally, we just sum up each row's coefficients for the total.

sheep[, coeftot := rowSums(.SD), .SDcols = breed_codes]

If you want to collapse breeds other than the specific seven into a single "other" column, then we'll just identify them, sum them up by row, and remove them from the dataset.

special_breeds <- c("SU", "DP", "RV", "RI", "CD", "PO", "HA")
non_special_breeds <- setdiff(breed_codes, special_breeds)

sheep[, other := rowSums(.SD), .SDcols = non_special_breeds]
set(sheep, j = non_special_breeds, value = NULL)
sheep
#    id          breed SU DP RI RV coeftot other
# 1:  1            SU1  1  0  0  0       1     0
# 2:  2         DP1RI1  0  1  1  0       2     0
# 3:  3   DP1RI1RV1SU1  1  1  1  1       4     0
# 4:  4         DP3XX1  0  3  0  0       4     1
# 5:  5         SU9RV7  9  0  0  7      16     0
# 6:  6            XX1  0  0  0  0       1     1
# 7:  7         DP7XX1  0  7  0  0       8     1
# 8:  8 SU32RV16DP8RI8 32  8  8 16      64     0