I currently have a python scripts that scrapes data from a single url.
In order to speed up the process I'm using the pool multiprocessing module in the script, this script is called "script_one.py" for the sake of explanation.
The script it exclusively does a "get request" to collect the json/html resuls from the target url and constantly switches proxy address, and saves the results on a text file.
My question is: If I run the same code (script_one.py) on multiple virtual machine, will I further speed up the process without incurring into any issue with GIL?
Here below is my code:
import requests,time,random
from multiprocessing import Pool
def script_one(file_name,from_letter,to_letter):
print('Here it does the get request and collects data')
print('Here it saves on file')
if __name__ == '__main__':
with Pool(5) as p:
print(p.starmap(script_one,[('r_ba', 'r', 'rba'),('rbrca', 'rb', 'rca'),('rcrda', 'rc', 'rda'),
('rdrea', 'rd', 'rea'),('rerfa', 're', 'rfa'),('rfrga', 'rf', 'rga'),
('rgrha', 'rg', 'rha'),('rhria', 'rh', 'ria'),('rirja', 'ri', 'rja'),
('rjrka', 'rj', 'rka'),('rkrla', 'rk', 'rla'),('rlrma', 'rl', 'rma'),
('rmrna', 'rm', 'rna'),('rnroa', 'rn', 'roa'),('rorpa', 'ro', 'rpa'),
('rprqa', 'rp', 'rqa'),('rqrra', 'rq', 'rra'),('rrrsa', 'rr', 'rsa'),
('rsrta', 'rs', 'rta'),('rtrua', 'rt', 'rua'),('rurva', 'ru', 'rva'),
('rvrwa', 'rv', 'rwa'),('rwrxa', 'rw', 'rxa'),('rxrya', 'rx', 'rya'),
('ryrza', 'ry', 'rza'),('rzr0a', 'rz', 'r0a')]))
p.close()
p.join()
Currently there are multiple options available: - Multi-processing - Multi-threading - Using multiple virtual machine in parallel - For windows user, I also found a good way to use multiple desktop (I'm guessing should work the same for linux users) - Also you can manually run multiple terminals windows at the same time, as (Credit: @MatteoItalia) during request, while waiting for socket GIL gets released.
Credit: @MatteoItalia