I want to separate a column of my dataset into different columns. However, the problem is that the column of interest is dirty and does not split the I want. This is the column I have.
notified_amount | nomenclature_of_loans |
---|---|
7000 | 6.64% GS 2029 |
8000 | 7.46%GS 2017 (1) |
25000 | 12.25 % GS, 2010 (1) (12) |
I have tried the following code:
govt_1 |>
separate_wider_delim(nomenclature_of_loans, delim = " ", names = c("rate", "type", "year"),
too_few = "align_end", too_many = "merge")
I tried the separate_wider_delim
function to extract the year
from the nomenclature_of_loans
column. Though I am successful in doing so for quite some rows, dirty rows are preventing me from extracting year
for all the rows and I am stuck.
Now I don't care about the rate
and type
columns. I want to successfully separate the year
from the nomenclature_of_loans
column. While the first column perfectly splits, I am having problems with the second and third type. Mind you, my dataset has 2,800 rows. And around 350 rows are throwing a similar problem. I am stuck and unsure on how to proceed.
Ideally, this is the output I want
notified_amount | rate | year |
---|---|---|
7000 | 6.64% | 2029 |
8000 | 7.46% | 2017 |
25000 | 12.25% | 2010 |
But, I am okay with extracting just the year
alone too.
If you have any information on how to pre-clean the data before splitting or how to further clean the after splitting the data, please let me know. Thanks in advance!
Using str_extract()
from the stringr
library, we can try:
govt_1$rate <- str_extract(govt_1$nomenclature_of_loans, "^\\d+(?:\\.\\d+)?\\s*%")
govt_1$year <- str_extract(govt_1$nomenclature_of_loans, "\\b\\d{4}\\b")