I'm still working with this huge list of URLs, all the help I have received has been great.
At the moment I have the list looking like this (17000 URLs though):
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=3
I can filter out the duplicates no problem with a couple of methods, awk etc. What I am really looking to do it take out the duplicate URLs but at the same time taking a count of how many times the URL exists in the list and printing the count next to the URL with a pipe separator. After processing the list it should look like this:
url | count |
---|---|
http://www.example.com/page?CONTENT_ITEM_ID=1 |
2 |
http://www.example.com/page?CONTENT_ITEM_ID=2 |
2 |
http://www.example.com/page?CONTENT_ITEM_ID=3 |
3 |
What method would be the fastest way to achieve this?
Are you're going to do this over and over again? If not, then "fastest" as in fastest to implement is probably
sort </file/of/urls | uniq --count | awk '{ print $2, " | ", $1}'
(not tested, I'm not near a UNIX command line)