Search code examples
python-3.xtcppython-requestswindows-10python-multithreading

Multithreading crawler get slower and slower after running for some time


I wrote a multithreaded web crawler under Windows. The libraries that I used were requests and threading. I found the program became slower and slower after running for some time (about 500 pages). When I stop the program and run again, the program speeds up again. It seems that there are many pending connections, causing the slowdown. How should I manage the problem?

My code:

import requests, threading,queue
req = requests.Session()

urlQueue = queue.Queue()
pageList = []
urlList = [url1,url2,....url500]
[urlQueue.put(i) for i in urlList]

def parse(urlQueue):

    try:
       url = urlQueue.get_nowait()
    except:
       break
    try:
       page = req.get(url)
       pageList.append(page)
    except:
       continue

if __name__ == '__main__':

    threadNum = 4
    threadList = []
    for i in threadNum:
        t = threading.Thread(target=(parse),args=(urlQueue,))
        threadList.append(t)
    for thread in threadList:
        thread.start()
    for thread in threadList:
        thread.join()

I searched for the problem. An answer told that it was the reuse and recycling problem of TCP under Linux. I don't understand that answer very well. The answer is below. I translated the answer from the Chinese.

  1. Type command in Linux shell: netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
  2. Found the TIME_WAIT is nearly 2W. So, there must be many TCP connections.
  3. Use the following code to set the reuse time and recycling time, respectively of TCP: echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse, echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle

That answer seems correct. It should be a network problem. How should I solve this under Windows.


Solution

  • The multithreaded crawler will exhaust the TCP connections. We need to set the TcpTimedWaitDelay to quickly reuse and recycle the TCP connections. We can solve the problem by manually changing the regedit or typing the code.

    How to do it on Windows with code: (You need to run the code as an administrator, or otherwise, an error would be raised.)

    import win32api,win32con
    
    key = win32api.RegOpenKey(win32con.HKEY_LOCAL_MACHINE, r'SYSTEM\CurrentControlSet\Services\Tcpip\Parameters', 0, win32con.KEY_SET_VALUE)
    
    win32api.RegSetValueEx(key, 'TcpTimedWaitDelay', 0, win32con.REG_SZ, '30')
    
    win32api.RegCloseKey(key)
    

    How to do it on Windows manually:

    1. Open RUN, and type regedit
    2. Find: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters
    3. Click Edit - New - Expandable String Value
    4. Create TcpTimedWaitDelay (if this entry already existed, you do not need to create)
    5. Change the value to 30. (The TCP value ranges from 30 to 300 seconds, and the default is 120 seconds. The default value is too long for multithreading crawler.)

    Thank you for all of your guys' contribute to the questions. This helps a lot of people.

    Reference site