Search code examples
text-processing

Count Duplicate URLs, fastest method possible


I'm still working with this huge list of URLs, all the help I have received has been great.

At the moment I have the list looking like this (17000 URLs though):

  • http://www.example.com/page?CONTENT_ITEM_ID=1
  • http://www.example.com/page?CONTENT_ITEM_ID=3
  • http://www.example.com/page?CONTENT_ITEM_ID=2
  • http://www.example.com/page?CONTENT_ITEM_ID=1
  • http://www.example.com/page?CONTENT_ITEM_ID=2
  • http://www.example.com/page?CONTENT_ITEM_ID=3
  • http://www.example.com/page?CONTENT_ITEM_ID=3

I can filter out the duplicates no problem with a couple of methods, awk etc. What I am really looking to do it take out the duplicate URLs but at the same time taking a count of how many times the URL exists in the list and printing the count next to the URL with a pipe separator. After processing the list it should look like this:

url count
http://www.example.com/page?CONTENT_ITEM_ID=1 2
http://www.example.com/page?CONTENT_ITEM_ID=2 2
http://www.example.com/page?CONTENT_ITEM_ID=3 3

What method would be the fastest way to achieve this?


Solution

  • Are you're going to do this over and over again? If not, then "fastest" as in fastest to implement is probably

    sort </file/of/urls | uniq --count | awk '{ print $2, " | ", $1}'
    

    (not tested, I'm not near a UNIX command line)