Search code examples
miller

What is the Miller command for separating emails into their own rows, while also copying down other column data?


I have a very large csv file (213,265 rows) with many columns.

In one of those columns I have some emails seperated by commas. A trimmed down version of the csv file looks like this:

enter image description here

I would like to use Miller to seperate out those emails into their own rows, but also copy down ALL the other columns in the spreadsheet (many of which aren't shown here in this simple example).

Following on with this example, I would like to end with something like this. But keep in mind the real spreadsheet has many other columns before and after the email column:

enter image description here

Is that possible to do with Miller (or another similar tool)? What would the command look like?


Solution

  • The verb is nest. Starting from

    company,address,email
    anna,123 fake,"anna@ciao.it,annac@gfail.com,a@box.net"
    

    and running

    mlr --csv nest --explode --values --across-records --nested-fs "," -f email input.csv
    

    you will have

    +---------+----------+-----------------+
    | company | address  | email           |
    +---------+----------+-----------------+
    | anna    | 123 fake | anna@ciao.it    |
    | anna    | 123 fake | annac@gfail.com |
    | anna    | 123 fake | a@box.net       |
    +---------+----------+-----------------+
    

    If you have a "bad" CSV, you could have some problems and you should try to clean it. A generic clean command could be this one:

    mlr --csv -N clean-whitespace then remove-empty-columns then skip-trivial-records then cat -n sample.csv | mlr --csv nest --explode --values --across-records --nested-fs "," -f Email >output.csv
    

    It removes empty rows, empty columns and wrong white spaces.