Search code examples
rstringr

Lost on how to separate columns


I want to separate a column of my dataset into different columns. However, the problem is that the column of interest is dirty and does not split the I want. This is the column I have.

notified_amount nomenclature_of_loans
7000 6.64% GS 2029
8000 7.46%GS 2017 (1)
25000 12.25 % GS, 2010 (1) (12)

I have tried the following code:

govt_1 |> 
    separate_wider_delim(nomenclature_of_loans, delim = " ", names = c("rate", "type", "year"),
                         too_few = "align_end", too_many = "merge")

I tried the separate_wider_delim function to extract the year from the nomenclature_of_loans column. Though I am successful in doing so for quite some rows, dirty rows are preventing me from extracting year for all the rows and I am stuck.

Now I don't care about the rate and type columns. I want to successfully separate the year from the nomenclature_of_loans column. While the first column perfectly splits, I am having problems with the second and third type. Mind you, my dataset has 2,800 rows. And around 350 rows are throwing a similar problem. I am stuck and unsure on how to proceed.

Ideally, this is the output I want

notified_amount rate year
7000 6.64% 2029
8000 7.46% 2017
25000 12.25% 2010

But, I am okay with extracting just the year alone too.

If you have any information on how to pre-clean the data before splitting or how to further clean the after splitting the data, please let me know. Thanks in advance!


Solution

  • Using str_extract() from the stringr library, we can try:

    govt_1$rate <- str_extract(govt_1$nomenclature_of_loans, "^\\d+(?:\\.\\d+)?\\s*%")
    govt_1$year <- str_extract(govt_1$nomenclature_of_loans, "\\b\\d{4}\\b")