For quite some time I have been trying to format space separated data to a CSV structure.
The initial data table is given by:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
It contains lots of spaces and unnecessary information throughout. The information is present somewhat like this
Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.
I want to convert it to the following format
Doctor's name,Specialization,Hospital name,Address,Fees,Schedule
So the current data should look like this
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
Till now I have succeeded in removing the Book Appointment field.
However I am facing difficulties in classifying the hospital's name. As the spacing in it varies a lot. Is this problem feasible?
The output of cat -A file
is the following:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
There's no straightforward way to separate the specialization from the hospital name, but with some assumptions, you could perhaps use perl
to do this:
perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file
Gives:
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM
And since it's perl based regex, you can use regex101 to get a glimpse of how it works through the regex debugger. The regex is quite straightforward, but the fact that there are many parts can make it appear daunting.
Warning: The above is able to separate the specialization based on two things:
I know it might not solve the complete problems as there are always lines that won't fit the above rules, but that can get you started on cleaning these up. If there is anything incorrectly separated (i.e. when the specialization consists of more than 1 word and the hospital name doesn't have two consecutive upper/digit) you will have one word of the specialization correctly placed, and the rest in the hospital name.