I'm trying to split a column that are formatted very differently. For example:
pharma <- c("DOXORUBICINA CLORH. FAM 50MG POL O LIOF",
"DROSPIRENONA/ETINILESTR. 3/0,02MG CM REC",
"DROSPIRENONA/ETINILESTR. 3/0,03MG CM REC",
"ETRAVIRINA 100 MG CM",
"AGALSIDASA ALFA 1MG/ML X 3,5 ML FAM")
And i'm using separate()
to do the split in two different columns (i need separate the product name (i.e. DOXORUBICINA CLORH. FAM) and the details (50MG POL O LIOF)). The code is:
separate(data.frame(A = pharma), col = "A" , into = c("x","y"),sep = "(?<=[a-zA-Z])\\s*(?=[0-9])")
But i have the next by from R:
x y
1 DOXORUBICINA CLORH. FAM 50MG POL O LIOF
2 DROSPIRENONA/ETINILESTR. 3/0,02MG CM REC <NA>
3 DROSPIRENONA/ETINILESTR. 3/0,03MG CM REC <NA>
4 ETRAVIRINA 100 MG CM
5 AGALSIDASA ALFA 1MG/ML X
Warning messages:
1: Expected 2 pieces. Additional pieces discarded in 1 rows [5].
2: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [2, 3].
I can't see what is happening.
Any help is highly appreciated. Thank you in advance!
The data on the second and third row contains a dot between the letters and whitespace, your pattern only accounts for 0+ whitespace chars between a letter and a digit.
You may use
sep = "(?<=[a-zA-Z])\\W+(?=[0-9])"
or
sep = "(?<=[a-zA-Z])\\W*(?=[0-9])"
The \W
pattern matches any non-word chars, any char other than letter, digit and _
.
See the regex demo.
R test:
> separate(data.frame(A = pharma), col = "A" , into = c("x","y"), sep = "(?<=[a-zA-Z])\\W*(?=[0-9])")
x y
1 DOXORUBICINA CLORH. FAM 50MG POL O LIOF
2 DROSPIRENONA/ETINILESTR 3/0,02MG CM REC
3 DROSPIRENONA/ETINILESTR 3/0,03MG CM REC
4 ETRAVIRINA 100 MG CM