I wrote a multithreaded web crawler under Windows. The libraries that I used were requests
and threading
. I found the program became slower and slower after running for some time (about 500 pages). When I stop the program and run again, the program speeds up again. It seems that there are many pending connections, causing the slowdown. How should I manage the problem?
My code:
import requests, threading,queue
req = requests.Session()
urlQueue = queue.Queue()
pageList = []
urlList = [url1,url2,....url500]
[urlQueue.put(i) for i in urlList]
def parse(urlQueue):
try:
url = urlQueue.get_nowait()
except:
break
try:
page = req.get(url)
pageList.append(page)
except:
continue
if __name__ == '__main__':
threadNum = 4
threadList = []
for i in threadNum:
t = threading.Thread(target=(parse),args=(urlQueue,))
threadList.append(t)
for thread in threadList:
thread.start()
for thread in threadList:
thread.join()
I searched for the problem. An answer told that it was the reuse and recycling problem of TCP under Linux. I don't understand that answer very well. The answer is below. I translated the answer from the Chinese.
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
TIME_WAIT
is nearly 2W. So, there must be many TCP connections.echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
, echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
That answer seems correct. It should be a network problem. How should I solve this under Windows.
The multithreaded crawler will exhaust the TCP connections. We need to set the TcpTimedWaitDelay
to quickly reuse and recycle the TCP connections. We can solve the problem by manually changing the regedit
or typing the code.
How to do it on Windows with code: (You need to run the code as an administrator, or otherwise, an error would be raised.)
import win32api,win32con
key = win32api.RegOpenKey(win32con.HKEY_LOCAL_MACHINE, r'SYSTEM\CurrentControlSet\Services\Tcpip\Parameters', 0, win32con.KEY_SET_VALUE)
win32api.RegSetValueEx(key, 'TcpTimedWaitDelay', 0, win32con.REG_SZ, '30')
win32api.RegCloseKey(key)
How to do it on Windows manually:
RUN
, and type regedit
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters
Edit
- New
- Expandable String Value
TcpTimedWaitDelay
(if this entry already existed, you do not need to create)Thank you for all of your guys' contribute to the questions. This helps a lot of people.