Search code examples
regexbashcsvawksed

Remove comma from quoted text in first column of a csv with sed


I have big_file.csv containing a bunch of company information. Here's a snippet

CompanyName, CompanyNumber,RegAddress.CareOf,...
"! # 1 AVAILABLE LOCKSMITH LTD","05905727","",...
"!NSPIRED LIMITED","06019953",""...
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD","07981734",""...

I only need the CompanyName and CompanyNumber fields, so I did the following:

cut -d, -f 1,2 big_file.csv > big_file_names_codes_only.csv

As you can see tho (and I understand why) the third entry in the big_file.csv gets cut after the first comma which is actually part of CompanyName. I know how to remove in sed the first comma (but that would break the whole csv strucutre), so i was wondering if any of you knew how to remove the comma from the first (it's always on position 1) "string, with, commas, or not and non alphanum chars!".

So basically the intermediate output i am looking for is:

CompanyName, CompanyNumber
"! # 1 AVAILABLE LOCKSMITH LTD","05905727"
"!NSPIRED LIMITED","06019953"
"CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD","07981734"

But this last line becomes:

"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD"

Once I get this intermediate output I need to clean the company of all non alpha num characters in the name and leading spaces - which works very well with this:

sed -i 's/[^a-zA-Z0-9 ,]//g; s/^[ \t]*//'

In the end my file should be:

CompanyName, CompanyNumber,RegAddress.CareOf,...
AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734

Solution

  • It's always better to work with structured data like CSV files with embedded commas in fields using tools that are actually aware of the format instead of trying to hack something together with things like regular expressions (Same with XML, JSON, etc.). In the long run it's a lot easier and will save you a ton of pain dealing with edge cases and odd data that doesn't exactly match your expectation.

    The csvkit set of utilities has a bunch of useful command line tools and is commonly available via OS package managers:

    $ csvcut -c CompanyName,CompanyNumber blah.csv                                              
    CompanyName,CompanyNumber
    ! # 1 AVAILABLE LOCKSMITH LTD,05905727
    !NSPIRED LIMITED,06019953
    "CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD",07981734
    

    You can then continue to use sed to remove the characters you're not interested in.

    (Note: I had to get rid of extra spaces in the header line of your sample data for this to work)


    Edit: Also, perl version using the handy Text::AutoCSV module, that strips out characters:

    $ perl -MText::AutoCSV -e 'Text::AutoCSV->new(out_fields => [ "COMPANYNAME", "COMPANYNUMBER" ],
                   read_post_update_hr => sub {
                     my $hr = shift;
                     $hr->{"COMPANYNAME"} =~ s/[^[:alnum:]\s]+//g;
                     $hr->{"COMPANYNAME"} =~ s/^\s+//;
                   })->write();' < blah.csv | sed -e 's/"//g'
    CompanyName,CompanyNumber
    1 AVAILABLE LOCKSMITH LTD,05905727
    NSPIRED LIMITED,06019953
    CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734