Search code examples
c++linuxlinux-kernelsystemtapapache-traffic-server

why does copy_to_user spend several hundreds of milliseconds?


I have some linux servers running apache trafficserver and have found about 1/20's requests take several milliseconds which is much longer than expected(should be 1-2ms).

After tracing by systemtap, I'v found the time is spent in linux kernel copy_to_user() call. (accept->inet_csk_accept->move_addr_to_user->copy_to_user)

The server load is quit low (<100qps) and copy_to_user() just copy 16 bytes' data(struct sockaddr) but spend hundreds of millisecond.

As I'm quit new to systemtap and kernel trace technique, I can't investigate further reason. I've checked cpu usage, swap usage

Thanks for any advice.

Hardware:

  • cpu: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
  • memory: 64G
  • disk: 11 * 6T hdd

Software:

  • centos6 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • apache trafficserver 5.3.2 configure 10G ram
  • nginx 1.10.2, proxy user request to trafficserver

    // systemtap
    probe kernel.function("move_addr_to_user")
    {
        ts["move_addr_to_user", pid()] = gettimeofday_ms()
    }
    
    probe kernel.function("move_addr_to_user").return
    {
        printf("[%d]move_addr_to_user done:%d %s %d\n", gettimeofday_ms(), pid(), execname(), gettimeofday_ms()-ts["move_addr_to_user", pid()])
    }
    
    probe kernel.function("copy_to_user")
    {
        ts["copy_to_user", pid()] = gettimeofday_ms()
    }
    
    probe kernel.function("copy_to_user").return
    {
        printf("[%d]copy_to_user done:%d %s %d %d\n", gettimeofday_ms(), pid(), execname(), gettimeofday_ms()-ts["copy_to_user", pid()],                 
                gettimeofday_ms()-ts["__copy_to_user", pid()])
    }
    
    // output:
    [1495630190767] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630191164]copy_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630191164] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630192172]copy_to_user done:24145 [ACCEPT 0:8080] 861
    [1495630192172] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 861
    [1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630192173] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630192173] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
    [1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0
    
    
    
    free -g
                 total       used       free     shared    buffers     cached
    Mem:            62         55          6          0          0         32
    -/+ buffers/cache:         23         39
    Swap:           15          0         15
    
    top - 20:57:39 up 24 days, 19:26,  2 users,  load average: 7.70, 9.43, 9.62
    Tasks: 643 total,   1 running, 642 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.1%us,  1.0%sy,  0.0%ni, 97.5%id,  1.1%wa,  0.0%hi,  0.3%si,  0.0%st
    Mem:  65560992k total, 58525192k used,  7035800k free,   365084k buffers
    Swap: 16777212k total,        0k used, 16777212k free, 33957572k cached
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                         
    24145 traffics  20   0 21.7g  12g 6212 S 24.7 19.3 212:42.85 [ET_NET 0]                                                                                                                                                                     
    22173 root      20   0  677m 325m 1180 S  3.6  0.5   0:41.10 nginx                                                                                                                                                                           
    22161 root      20   0  677m 325m 1184 S  2.6  0.5   0:47.50 nginx                                                                                                                                                                           
    22168 root      20   0  677m 326m 2076 S  2.6  0.5   0:28.31 nginx                                                                                                                                                                           
    22150 root      20   0  677m 325m 1208 S  1.6  0.5   0:42.75 nginx                                                                                                                                                                           
    22165 root      20   0  677m 325m 1200 S  1.6  0.5   0:31.77 nginx
    

Update:

@employee of the month thanks for you advise, after add some probe on __do_page_fault, I found the time was spent on __do_page_fault -> down_read(&mm->mmap_sem);

    [1495677639202]copy_to_user done:24145 [ACCEPT 0:8080] 1
    [1495677639202] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 1
    [1495677639314]copy_to_user done:24145 [ACCEPT 0:8080] 0
    [1495677639314] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
    [1495677641329]do page fault done:24145 [ACCEPT 0:8080] 622
    [1495677641329]copy_to_user done:24145 [ACCEPT 0:8080] 622
    [1495677641329] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 622

@Ricardo Biehl Pasquali thanks for your advise. Apache trafficserver read/write to hdd in seperate threads. trafficserver have 1 thread to accept connections, 88(8 threads per hdd)threads to do blocking read/write(cache content), but i don't understand why blocking read/write in other threads could lead to high latency of __do_page_fault().

    Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sda               0.00    16.00    0.00    4.00     0.00   160.00    40.00     0.05   13.50    0.00   13.50   7.75   3.10
    sdi               0.00     0.00    0.00    0.00     0.00     0.00      0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdj               0.00     0.00    6.00    0.00  4326.00     0.00   721.00     0.06   10.50   10.50    0.00   6.83   4.10
    sdc               0.00     0.00    2.00    0.00  1472.00     0.00   736.00     0.04   18.50   18.50    0.00   9.50   1.90
    sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdg               0.00     0.00   14.00    0.00 10464.00     0.00   747.43     0.04    2.71    2.71    0.00   1.79   2.50

Update:

Problem solved! The root cause is trafficserver periodly call system() to backup file in an ancient plugin we wrote.


Solution

  • Well, the reasonable explanation is that you are taking a page fault. One possible reason would be that the server forks/exits depending on how the load changes and the new child has to fault the page, but as a minor fault it should not take long unless there is memory shortage.

    Another one is there is memory shortage and you are in fact swapping.

    Either way, given you have come this far I don't see what's the problem going further. You want to put the probe on __do_page_fault and continue from there.