I have an array with a lot of keywords:
array = ['table', 'chair', 'pen']
I want to crawl 5 images from Google Image Search for each item in my array
with python icrawler
Here is the initialization:
from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(
parser_threads=2,
downloader_threads=4,
storage={ 'root_dir': 'images' }
)
I use a loop to crawl each item:
for item in array:
google_crawler.crawl(
keyword=item,
offset=0,
max_num=5,
min_size=(500, 500)
)
However, I get the error log:
File "crawler.py", line 20, in <module>
min_size=(500, 500)
File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/builtin/google.py", line 83, in crawl
feeder_kwargs=feeder_kwargs, downloader_kwargs=downloader_kwargs)
File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/crawler.py", line 166, in crawl
self.feeder.start(**feeder_kwargs)
File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/site-packages/icrawler/utils/thread_pool.py", line 66, in start
worker.start()
File "/home/user/opt/miniconda3/envs/pak/lib/python3.6/threading.py", line 842, in start
raise RuntimeError("threads can only be started once")
RuntimeError: threads can only be started once
which seems that I cannot use google_crawler.crawl
more than once. How can I fix that?
In the latest version, you can use it like this.
from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(
parser_threads=2,
downloader_threads=4,
storage={'root_dir': 'images'}
)
for keyword in ['cat', 'dog']:
google_crawler.crawl(
keyword=keyword, max_num=5, min_size=(500, 500), file_idx_offset='auto')
# set `file_idx_offset` to 'auto' will prevent naming the 5 images
# of dog from 000001.jpg to 000005.jpg, but naming it from 000006.jpg.
or if you want to download these images to different folders, you can simply create two GoogleImageCrawler
instances.
from icrawler.builtin import GoogleImageCrawler
for keyword in ['cat', 'dog']:
google_crawler = GoogleImageCrawler(
parser_threads=2,
downloader_threads=4,
storage={'root_dir': 'images/{}'.format(keword)}
)
google_crawler.crawl(
keyword=keyword, max_num=5, min_size=(500, 500))