complete dataset link : https://drive.google.com/open?id=12u0Ql1z5T2lzCXRVjp75i9ke9mNYrCWv
In this you can see general motors are not counted together as they are in different category. Like this many more manufacturer's are there. I want to group them together like General Motors. How can I group them together using nlp
in r
?
Try this way to achieve your goal:
Your Input data.frame:
Vehicle_Manufacturer<-c("GENERAL MOTORS CORP.","FORD MOTOR COMPANY","CHRYSLER CORPORATION","PACCAR INCORPORATED","MACK TRUCKS, INCORPORATED","FOREST RIVER, INC.","BLUE BIRD BODY COMPANY","DAIMLER TRUCKS NORTH AMERICA","GENERAL MOTORS LLC","HONEYWELL INTERNATIONAL, INC.","WINNEBAGO INDUSTRIES, INC.","BMW OF NORTH AMERICA, LLC","NISSAN NORTH AMERICA, INC.","NAVISTAR INTL CORP.","INTERNATIONAL TRUCK AND ENGINE","FREIGHTLINER LLC","HONDA (AMERICAN HONDA MOTOR CO.)","NEWMAR CORPORATION","NAVISTAR, INC","INTERNATIONAL TRUCK & ENGINE CORPORATION","PIERCE MANUFACTURING","GULF STREAM COACH, INC.","FLEETWOOD ENTERPRISES, INC.","FREIGHTLINER CORPORATION","DAIMLER TRUCKS NORTH AMERICA LLC","PACCAR, INCORPORATED","WHITE MOTOR CORPORATION","BAYERISCHE MOTOREN WERKE","THOMAS BUILT BUSES, INC.","DAIMLERCHRYSLER CORPORATION","VOLKSWAGEN OF AMERICA,INC","SPARTAN MOTORS, INC.","VOLVO TRUCKS NORTH AMERICA INC","TOYOTA MOTOR ENGINEERING & MANUFACTURING","PREVOST CAR, INCORPORATED","CHAMPION BUS, INC.","ALTEC INDUSTRIES INC.","SABERSPORT","MERCEDES-BENZ USA, LLC.","HARLEY-DAVIDSON MOTOR COMPANY","COOPER TIRE & RUBBER CO.","KEYSTONE RV COMPANY","SUBARU OF AMERICA, INC.","CHRYSLER (FCA US LLC)","MONACO COACH CORPORATION","CHRYSLER GROUP LLC","JAYCO, INC.","MITSUBISHI FUSO TRUCK OF AMERICA, INC.","COLLINS BUS CORPORATION","PRO-A MOTORS, INC.","NAVISTAR, INC.")
Recalls<-c(6228,5403,2787,2317,1988,1903,1898,1737,1620,1558,1353,1297,1174,1130,1055,987,985,980,955,950,925,922,918,896,835,824,818,801,797,794,749,731,724,709,694,669,641,623,616,613,599,586,582,578,578,572,569,568,559,549,511)
df<-data.frame(Vehicle_Manufacturer,Recalls)
Using package stringdist
find similar strings between Vehicle_Manufacturer
, in this example using Jaro-Winkler distance:
dist_matrix<-stringdistmatrix(as.character(df[,1]),as.character(df[,1]),method="jw")
Find a threshold under that similar strings are grouped, like this:
thr<-quantile(dist_matrix,probs=0.025) #2.5% quantile
Find strings to merge (in this example a for-loop but if you have a lot of data a lapply
solution is better)
to_merge<-NULL
for(i in 1:nrow(df))
{
to_merge[[i]]<-Vehicle_Manufacturer[dist_matrix[i,]<thr]
}
Your output will be in to_merge
list
To see only possible merge:
to_merge[sapply(to_merge, length) > 1]
[[1]]
[1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"
[[2]]
[1] "PACCAR INCORPORATED" "PACCAR, INCORPORATED"
[[3]]
[1] "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"
[[4]]
[1] "DAIMLER TRUCKS NORTH AMERICA" "DAIMLER TRUCKS NORTH AMERICA LLC"
[[5]]
[1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"
[[6]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
[[7]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
[[8]]
[1] "DAIMLER TRUCKS NORTH AMERICA" "DAIMLER TRUCKS NORTH AMERICA LLC"
[[9]]
[1] "PACCAR INCORPORATED" "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"
[[10]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."