Search code examples
rr-factor

Adjust factors in dataset with dynamic preceding zero's


I have a big data.frame (1.9M records, with 20 columns). One of the columns is a factor column with values of digits with different length (different number of characters/digits, e.g. 567839, 234324324, 3243211 etc.) Note: these are numeric codes, no real values, could also be just characters of different lengths for this example.

Now I want to convert does factors to become 13-digit-factors, in such a way that a factor gets preceding zero's in case the number of digits is less than 13.

Example:

Old factor      Length  New factor
432543532532    12      0432543532532
3285087250932   13      3285087250932
464577534       9       0000464577534
2225324324324   13      2225324324324
864235325264    12      0864235325264

I tried different approaches, but now I'm stuck. The problem is that the lengte of the factor differs throughout the dataset.

I tried the following, with an example.

Create data.frame with three different columns on which I perform my code, to identify the problem.

> df.test <- as.data.frame(cbind(c("432543532532", "3285087250932", "464577534", "2225324324324", "864235325264"), c("3285087250932", "132543532532", "464577534", "2225324324324", "864235325264"), c("164577534", "3285087250932", "432543532532", "2225324324324", "864235325264")))
> df.test
             V1            V2            V3
1  432543532532 3285087250932     164577534
2 3285087250932  132543532532 3285087250932
3     464577534     464577534  432543532532
4 2225324324324 2225324324324 2225324324324
5  864235325264  864235325264  864235325264

> levels(df.test$V1) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V1)))), levels(df.test$V1), sep = '')
> levels(df.test$V2) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V2)))), levels(df.test$V2), sep = '')
> levels(df.test$V3) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V3)))), levels(df.test$V3), sep = '')
> df.test
             V1             V2                V3
1  432543532532 03285087250932     0000164577534
2 3285087250932  0132543532532 00003285087250932
3     464577534     0464577534  0000432543532532
4 2225324324324 02225324324324 00002225324324324
5  864235325264  0864235325264  0000864235325264

The problem is that the code nchar(as.character(levels(df.test$V1))) not uses the lengths of the vector df.test$V1 but just one value; the length of the first level of the factor (which is on alphabet/ascending). And it performs the number of necessary preceding zeros on all records. So no vector code!

Note: if I run the 'nchar' code seperately it gives me a vector of the lengths of all the records as a result, so I assumed it should work...

> nchar(as.character(levels(df.test$V1)))
[1] 13 13 12  9 12
> nchar(as.character(levels(df.test$V2)))
[1] 13 14 14 10 13
> nchar(as.character(levels(df.test$V3)))
[1] 13 17 17 16 16

Why isn't nchar(as.character(levels(df.test$V1))) running as a vector operator? Can anybody tell me how to change my code, so it will have the correct result?

Thanks in advance!

NB. Note that in the real case I only need to perform this adjustment on onecolumn of the data.frame.


Solution

  • for zero padding you can use sprintf('%04d', 1:5) but the codes in your example need to be numeric.

    max.nchar <- max(nchar(levels(df.test$V1)))
    
    sprintf(paste0('%0',max.nchar), as.numeric(levels(df$V1))[df$V1])
    

    Maybe there is a better way... but you can use gsub with sprintf:

    gsub(' ', '0', sprintf('%04s', levels(factor(10:15))))