Search code examples
pythonpython-3.xweb-crawlergoogle-crawlers

How to crawl multiple keywords with python icrawler


I have an array with a lot of keywords:

array = ['table', 'chair', 'pen']

I want to crawl 5 images from Google Image Search for each item in my array with python icrawler

Here is the initialization:

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(
  parser_threads=2, 
  downloader_threads=4,
  storage={ 'root_dir': 'images' }
)

I use a loop to crawl each item:

for item in array:
  google_crawler.crawl(
    keyword=item, 
    offset=0, 
    max_num=5,
    min_size=(500, 500)
  )

However, I get the error log:

  File "crawler.py", line 20, in <module>
    min_size=(500, 500)
  File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/builtin/google.py", line 83, in crawl
    feeder_kwargs=feeder_kwargs, downloader_kwargs=downloader_kwargs)
  File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/crawler.py", line 166, in crawl
    self.feeder.start(**feeder_kwargs)                                   
  File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/utils/thread_pool.py", line 66, in start
    worker.start()                                                       
  File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/threading.py", line 842, in start
    raise RuntimeError("threads can only be started once")
RuntimeError: threads can only be started once

which seems that I cannot use google_crawler.crawl more than once. How can I fix that?


Solution

  • In the latest version, you can use it like this.

    from icrawler.builtin import GoogleImageCrawler
    
    google_crawler = GoogleImageCrawler(
        parser_threads=2,
        downloader_threads=4,
        storage={'root_dir': 'images'}
    )
    
    for keyword in ['cat', 'dog']:
        google_crawler.crawl(
            keyword=keyword, max_num=5, min_size=(500, 500), file_idx_offset='auto')
        # set `file_idx_offset` to 'auto' will prevent naming the 5 images
        # of dog from 000001.jpg to 000005.jpg, but naming it from 000006.jpg.
    

    or if you want to download these images to different folders, you can simply create two GoogleImageCrawler instances.

    from icrawler.builtin import GoogleImageCrawler
    
    for keyword in ['cat', 'dog']:
        google_crawler = GoogleImageCrawler(
            parser_threads=2,
            downloader_threads=4,
            storage={'root_dir': 'images/{}'.format(keword)}
        )
        google_crawler.crawl(
            keyword=keyword, max_num=5, min_size=(500, 500))