I have a big data.frame
(1.9M records, with 20 columns). One of the columns is a factor column with values of digits with different length (different number of characters/digits, e.g. 567839, 234324324, 3243211 etc.)
Note: these are numeric codes, no real values, could also be just characters of different lengths for this example.
Now I want to convert does factors to become 13-digit-factors, in such a way that a factor gets preceding zero's in case the number of digits is less than 13.
Example:
Old factor Length New factor
432543532532 12 0432543532532
3285087250932 13 3285087250932
464577534 9 0000464577534
2225324324324 13 2225324324324
864235325264 12 0864235325264
I tried different approaches, but now I'm stuck. The problem is that the lengte of the factor differs throughout the dataset.
I tried the following, with an example.
Create data.frame
with three different columns on which I perform my code, to identify the problem.
> df.test <- as.data.frame(cbind(c("432543532532", "3285087250932", "464577534", "2225324324324", "864235325264"), c("3285087250932", "132543532532", "464577534", "2225324324324", "864235325264"), c("164577534", "3285087250932", "432543532532", "2225324324324", "864235325264")))
> df.test
V1 V2 V3
1 432543532532 3285087250932 164577534
2 3285087250932 132543532532 3285087250932
3 464577534 464577534 432543532532
4 2225324324324 2225324324324 2225324324324
5 864235325264 864235325264 864235325264
> levels(df.test$V1) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V1)))), levels(df.test$V1), sep = '')
> levels(df.test$V2) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V2)))), levels(df.test$V2), sep = '')
> levels(df.test$V3) <- paste(substr("0000000000000", 0, 13 - nchar(as.character(levels(df.test$V3)))), levels(df.test$V3), sep = '')
> df.test
V1 V2 V3
1 432543532532 03285087250932 0000164577534
2 3285087250932 0132543532532 00003285087250932
3 464577534 0464577534 0000432543532532
4 2225324324324 02225324324324 00002225324324324
5 864235325264 0864235325264 0000864235325264
The problem is that the code nchar(as.character(levels(df.test$V1)))
not uses the lengths of the vector df.test$V1
but just one value; the length of the first level of the factor (which is on alphabet/ascending). And it performs the number of necessary preceding zeros on all records. So no vector code!
Note: if I run the 'nchar' code seperately it gives me a vector of the lengths of all the records as a result, so I assumed it should work...
> nchar(as.character(levels(df.test$V1)))
[1] 13 13 12 9 12
> nchar(as.character(levels(df.test$V2)))
[1] 13 14 14 10 13
> nchar(as.character(levels(df.test$V3)))
[1] 13 17 17 16 16
Why isn't nchar(as.character(levels(df.test$V1)))
running as a vector operator?
Can anybody tell me how to change my code, so it will have the correct result?
Thanks in advance!
NB. Note that in the real case I only need to perform this adjustment on onecolumn of the data.frame
.
for zero padding you can use sprintf('%04d', 1:5)
but the codes in your example need to be numeric.
max.nchar <- max(nchar(levels(df.test$V1)))
sprintf(paste0('%0',max.nchar), as.numeric(levels(df$V1))[df$V1])
Maybe there is a better way... but you can use gsub
with sprintf
:
gsub(' ', '0', sprintf('%04s', levels(factor(10:15))))