I have a very large csv file (213,265 rows) with many columns.
In one of those columns I have some emails seperated by commas. A trimmed down version of the csv file looks like this:
I would like to use Miller to seperate out those emails into their own rows, but also copy down ALL the other columns in the spreadsheet (many of which aren't shown here in this simple example).
Following on with this example, I would like to end with something like this. But keep in mind the real spreadsheet has many other columns before and after the email column:
Is that possible to do with Miller (or another similar tool)? What would the command look like?
The verb is nest. Starting from
company,address,email
anna,123 fake,"anna@ciao.it,annac@gfail.com,a@box.net"
and running
mlr --csv nest --explode --values --across-records --nested-fs "," -f email input.csv
you will have
+---------+----------+-----------------+
| company | address | email |
+---------+----------+-----------------+
| anna | 123 fake | anna@ciao.it |
| anna | 123 fake | annac@gfail.com |
| anna | 123 fake | a@box.net |
+---------+----------+-----------------+
If you have a "bad" CSV, you could have some problems and you should try to clean it. A generic clean command could be this one:
mlr --csv -N clean-whitespace then remove-empty-columns then skip-trivial-records then cat -n sample.csv | mlr --csv nest --explode --values --across-records --nested-fs "," -f Email >output.csv
It removes empty rows, empty columns and wrong white spaces.