Search code examples
rmachine-learningnlpdata-analysis

How to group similar data in a column using nlp in r?


complete dataset link : https://drive.google.com/open?id=12u0Ql1z5T2lzCXRVjp75i9ke9mNYrCWv

In this you can see general motors are not counted together as they are in different category. Like this many more manufacturer's are there. I want to group them together like General Motors. How can I group them together using nlp in r?


Solution

  • Try this way to achieve your goal:

    Your Input data.frame:

    Vehicle_Manufacturer<-c("GENERAL MOTORS CORP.","FORD MOTOR COMPANY","CHRYSLER CORPORATION","PACCAR INCORPORATED","MACK TRUCKS, INCORPORATED","FOREST RIVER, INC.","BLUE BIRD BODY COMPANY","DAIMLER TRUCKS NORTH AMERICA","GENERAL MOTORS LLC","HONEYWELL INTERNATIONAL, INC.","WINNEBAGO INDUSTRIES, INC.","BMW OF NORTH AMERICA, LLC","NISSAN NORTH AMERICA, INC.","NAVISTAR INTL CORP.","INTERNATIONAL TRUCK AND ENGINE","FREIGHTLINER LLC","HONDA (AMERICAN HONDA MOTOR CO.)","NEWMAR CORPORATION","NAVISTAR, INC","INTERNATIONAL TRUCK & ENGINE CORPORATION","PIERCE MANUFACTURING","GULF STREAM COACH, INC.","FLEETWOOD ENTERPRISES, INC.","FREIGHTLINER CORPORATION","DAIMLER TRUCKS NORTH AMERICA LLC","PACCAR, INCORPORATED","WHITE MOTOR CORPORATION","BAYERISCHE MOTOREN WERKE","THOMAS BUILT BUSES, INC.","DAIMLERCHRYSLER CORPORATION","VOLKSWAGEN OF AMERICA,INC","SPARTAN MOTORS, INC.","VOLVO TRUCKS NORTH AMERICA INC","TOYOTA MOTOR ENGINEERING & MANUFACTURING","PREVOST CAR, INCORPORATED","CHAMPION BUS, INC.","ALTEC INDUSTRIES INC.","SABERSPORT","MERCEDES-BENZ USA, LLC.","HARLEY-DAVIDSON MOTOR COMPANY","COOPER TIRE & RUBBER CO.","KEYSTONE RV COMPANY","SUBARU OF AMERICA, INC.","CHRYSLER (FCA US LLC)","MONACO COACH CORPORATION","CHRYSLER GROUP LLC","JAYCO, INC.","MITSUBISHI FUSO TRUCK OF AMERICA, INC.","COLLINS BUS CORPORATION","PRO-A MOTORS, INC.","NAVISTAR, INC.")
    Recalls<-c(6228,5403,2787,2317,1988,1903,1898,1737,1620,1558,1353,1297,1174,1130,1055,987,985,980,955,950,925,922,918,896,835,824,818,801,797,794,749,731,724,709,694,669,641,623,616,613,599,586,582,578,578,572,569,568,559,549,511)
    df<-data.frame(Vehicle_Manufacturer,Recalls)
    

    Using package stringdist find similar strings between Vehicle_Manufacturer, in this example using Jaro-Winkler distance:

    dist_matrix<-stringdistmatrix(as.character(df[,1]),as.character(df[,1]),method="jw")
    

    Find a threshold under that similar strings are grouped, like this:

    thr<-quantile(dist_matrix,probs=0.025) #2.5% quantile
    

    Find strings to merge (in this example a for-loop but if you have a lot of data a lapply solution is better)

    to_merge<-NULL
    for(i in 1:nrow(df))
    {
      to_merge[[i]]<-Vehicle_Manufacturer[dist_matrix[i,]<thr]
    }
    

    Your output will be in to_merge list

    To see only possible merge:

    to_merge[sapply(to_merge, length) > 1]
    [[1]]
    [1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"  
    
    [[2]]
    [1] "PACCAR INCORPORATED"  "PACCAR, INCORPORATED"
    
    [[3]]
    [1] "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"     
    
    [[4]]
    [1] "DAIMLER TRUCKS NORTH AMERICA"     "DAIMLER TRUCKS NORTH AMERICA LLC"
    
    [[5]]
    [1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"  
    
    [[6]]
    [1] "NAVISTAR INTL CORP." "NAVISTAR, INC"       "NAVISTAR, INC."     
    
    [[7]]
    [1] "NAVISTAR INTL CORP." "NAVISTAR, INC"       "NAVISTAR, INC."     
    
    [[8]]
    [1] "DAIMLER TRUCKS NORTH AMERICA"     "DAIMLER TRUCKS NORTH AMERICA LLC"
    
    [[9]]
    [1] "PACCAR INCORPORATED"       "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"     
    
    [[10]]
    [1] "NAVISTAR INTL CORP." "NAVISTAR, INC"       "NAVISTAR, INC."