Search code examples
rregexgsubtext-extraction

Extracting repeated characters


I am trying to extract artist and title names. However it is a bit complicated. Here is the list;

nlist <- c(
"Lil' SlimLil' Slim feat. PxMxWxPxMxWx Where Your Ward At!!",               
"I Like It (Mannie Fresh Style)I Like It (Mannie Fresh Style)Ms. Tee",
"Bella VistaBella Vista Mister Wong",
"Tom WareTom WareChina Town",                                        
"Race 'N RhythmRace 'N Rhythm Teenage Girls",                                    
"Ronald MarquisseRonald MarquisseElectro Link 7",
"PleasurePleasure Thoughts Of Old Flames",
"OM, OM, Dom Um RomaoDom Um Romao Chipero",
"HookfaceHookface4 07 181221"
)

Here is the pattern in the strings.

PICTURE

Description:

  • There are three different patterns (1, 2-7, 8).
  • RED: Artist (repeated),
  • BLUE: Title (non-repeated),
  • GREEN: Conjunction (non-rep&between artist names)

1 and 8 is very hard and I couldn't solve. But for 2 to 7 below codes solve my problem.

title = str_trim(gsub('(.+?)\\1','', nlist))
artist = re.match('(.+?)\\1', nlist)[,2]
data = cbind(title,artist);data

And here the outputs of the above codes.

     title                                     artist                          
[1,] "feat. PxMxWxPxMxWx Where Your Ward At!!" "Lil' Slim"                     
[2,] "Ms. Tee"                                 "I Like It (Mannie Fresh Style)"
[3,] "Mister Wong"                             "Bella Vista"                   
[4,] "China Town"                              "Tom Ware"                      
[5,] "Teenage Girls"                           "Race 'N Rhythm"                
[6,] "Electro Link 7"                          "Ronald Marquisse"              
[7,] "Thoughts Of Old Flames"                  "Pleasure"                      
[8,] "Chipero"                                 "OM, "  
[9,] "4 07 181221"                             "Hookeface"   

Problem: When there is "feat." or "," in the string that cuts the repeated sequence of the string. Question: How can I extract truly the artist names like in below?

My expected result is here (Check 1 and 8);

     title                                     artist                          
[1,] "Where Your Ward At!!"                    "Lil' Slim feat. PxMxWx"                     
[2,] "Ms. Tee"                                 "I Like It (Mannie Fresh Style)"
[3,] "Mister Wong"                             "Bella Vista"                   
[4,] "China Town"                              "Tom Ware"                      
[5,] "Teenage Girls"                           "Race 'N Rhythm"                
[6,] "Electro Link 7"                          "Ronald Marquisse"              
[7,] "Thoughts Of Old Flames"                  "Pleasure"                      
[8,] "Chipero"                                 "OM, Dom Um Romao"                             
[9,] "4 07 181221"                             "Hookeface"                           

Thanks...


Solution

  • Maybe the following extracts what you want. I remove everything and the last repetition and store it in title. To get the artist I remove the length form the previously found title using substr and then remove the repetitions of the artist using gsub with (.{2,})\\1, but this will also remove repetitions in the conjunction .

    title <- sub(".*(.{2,})\\1\\s*", "", nlist)
    artist <- trimws(gsub("(.{2,})\\1", "\\1"
                  , substr(nlist, 1, nchar(nlist) - nchar(title)), perl=TRUE))
    cbind(title,artist)
    #      title                    artist                          
    # [1,] "Where Your Ward At!!"   "Lil' Slim feat. PxMxWx"        
    # [2,] "Ms. Tee"                "I Like It (Mannie Fresh Style)"
    # [3,] "Mister Wong"            "Bella Vista"                   
    # [4,] "China Town"             "Tom Ware"                      
    # [5,] "Teenage Girls"          "Race 'N Rhythm"                
    # [6,] "Electro Link 7"         "Ronald Marquisse"              
    # [7,] "Thoughts Of Old Flames" "Pleasure"                      
    # [8,] "Chipero"                "OM, Dom Um Romao"              
    # [9,] "4 07 181221"            "Hookface"                      
    

    Another way might be:

    x <- sub("^(.*)\\1\\s*", "", nlist)     #Remove the first repetition of artist
    title <- sub(".*?(.{2,})\\1\\s*", "", x) #Remove Conjunction and repetition of Artist if there is one
    artist <- trimws(gsub("(.{2,})\\1", "\\1"
                  , substr(nlist, 1, nchar(nlist) - nchar(title)), perl=TRUE))
    cbind(title,artist)
    #      title                    artist                          
    # [1,] "Where Your Ward At!!"   "Lil' Slim feat. PxMxWx"        
    # [2,] "Ms. Tee"                "I Like It (Mannie Fresh Style)"
    # [3,] "Mister Wong"            "Bella Vista"                   
    # [4,] "China Town"             "Tom Ware"                      
    # [5,] "Teenage Girls"          "Race 'N Rhythm"                
    # [6,] "Electro Link 7"         "Ronald Marquisse"              
    # [7,] "Thoughts Of Old Flames" "Pleasure"                      
    # [8,] "Chipero"                "OM, Dom Um Romao"              
    # [9,] "4 07 181221"            "Hookface"