There is an array of data:
https://example.com:description of the site/application:category
http://example.com:description of the site/application:category
android://package name:description of the site/application:category
android://package name|description of the site/application|category
I want to split the data into 3 columns:
URL | Description | Category |
---|---|---|
https://example.com | description of the site/application | category |
http://example.com | description of the site/application | category |
android://package name | description of the site/application | category |
android://package name | description of the site/application | category |
As I understand it, it is necessary to add a regEx to ignore the first ":" and also 2 argument for the divisor "|"
I tried this expression, but the output is incorrect
cat * | awk -F["|"][:] '{print $1,$2, $3}'
Using any awk:
$ awk -F'[:|]' -v OFS='\t' '{sub(/:/,RS); sub(RS,":",$1)} 1' file
https://example.com description of the site/application category
http://example.com description of the site/application category
android://package name description of the site/application category
android://package name description of the site/application category
or, if the OFS
character can't be present in the URL in the input:
$ awk -F'[:|]' -v OFS='\t' '{$1=$1; sub(OFS,":")} 1' file
https://example.com description of the site/application category
http://example.com description of the site/application category
android://package name description of the site/application category
android://package name description of the site/application category
Set OFS
to something other than \t
as you see fit.
Please read the POSIX spec to learn what bracket expression such as the ones you used, ["|"][:]
, and the one I used, [:|]
, mean.
Having said that, I suspect the OPs real input probably looks something like this (where additional :
s or |
s can appear in the URL and/or description, but no literal blanks can be in the URL):
$ cat file
https://example.com:description of : the site/application:category
http://example.com:description: of the site/application:category
android://package%20name:description of the site/application:category
android://package%20name|description of the site/application|category
android://package_name:17:something:description of the :huge: site/application:category
and then you can get the output you want using the following sed
script (using a sed that has -E
to enable EREs, e.g. GNU and BSD seds):
$ sed -E 's/([^ ]+)[:|]([^ ].*)[:|]/\1\t\2\t/' file
https://example.com description of : the site/application category
http://example.com description: of the site/application category
android://package%20name description of the site/application category
android://package%20name description of the site/application category
android://package_name:17:something description of the :huge: site/application category
or using any sed:
$ sed 's/\([^ ]*\)[:|]\([^ ].*\)[:|]/\1\t\2\t/' file
https://example.com description of : the site/application category
http://example.com description: of the site/application category
android://package%20name description of the site/application category
android://package%20name description of the site/application category
android://package_name:17:something description of the :huge: site/application category
Those sed commands assume a description contains at least 1 blank and doesn't start with a :
or word:word
- if that's not the case then there is no way to separate a description from a URL given what we know so far about the input.