I have some linux servers running apache trafficserver and have found about 1/20's requests take several milliseconds which is much longer than expected(should be 1-2ms).
After tracing by systemtap, I'v found the time is spent in linux kernel copy_to_user() call. (accept->inet_csk_accept->move_addr_to_user->copy_to_user)
The server load is quit low (<100qps) and copy_to_user() just copy 16 bytes' data(struct sockaddr) but spend hundreds of millisecond.
As I'm quit new to systemtap and kernel trace technique, I can't investigate further reason. I've checked cpu usage, swap usage
Thanks for any advice.
Hardware:
Software:
nginx 1.10.2, proxy user request to trafficserver
// systemtap
probe kernel.function("move_addr_to_user")
{
ts["move_addr_to_user", pid()] = gettimeofday_ms()
}
probe kernel.function("move_addr_to_user").return
{
printf("[%d]move_addr_to_user done:%d %s %d\n", gettimeofday_ms(), pid(), execname(), gettimeofday_ms()-ts["move_addr_to_user", pid()])
}
probe kernel.function("copy_to_user")
{
ts["copy_to_user", pid()] = gettimeofday_ms()
}
probe kernel.function("copy_to_user").return
{
printf("[%d]copy_to_user done:%d %s %d %d\n", gettimeofday_ms(), pid(), execname(), gettimeofday_ms()-ts["copy_to_user", pid()],
gettimeofday_ms()-ts["__copy_to_user", pid()])
}
// output:
[1495630190767] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
[1495630191164]copy_to_user done:24145 [ACCEPT 0:8080] 0
[1495630191164] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
[1495630192172]copy_to_user done:24145 [ACCEPT 0:8080] 861
[1495630192172] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 861
[1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0
[1495630192173] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
[1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0
[1495630192173] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
[1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0
free -g
total used free shared buffers cached
Mem: 62 55 6 0 0 32
-/+ buffers/cache: 23 39
Swap: 15 0 15
top - 20:57:39 up 24 days, 19:26, 2 users, load average: 7.70, 9.43, 9.62
Tasks: 643 total, 1 running, 642 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 1.0%sy, 0.0%ni, 97.5%id, 1.1%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 65560992k total, 58525192k used, 7035800k free, 365084k buffers
Swap: 16777212k total, 0k used, 16777212k free, 33957572k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24145 traffics 20 0 21.7g 12g 6212 S 24.7 19.3 212:42.85 [ET_NET 0]
22173 root 20 0 677m 325m 1180 S 3.6 0.5 0:41.10 nginx
22161 root 20 0 677m 325m 1184 S 2.6 0.5 0:47.50 nginx
22168 root 20 0 677m 326m 2076 S 2.6 0.5 0:28.31 nginx
22150 root 20 0 677m 325m 1208 S 1.6 0.5 0:42.75 nginx
22165 root 20 0 677m 325m 1200 S 1.6 0.5 0:31.77 nginx
Update:
@employee of the month thanks for you advise, after add some probe on __do_page_fault, I found the time was spent on __do_page_fault -> down_read(&mm->mmap_sem);
[1495677639202]copy_to_user done:24145 [ACCEPT 0:8080] 1
[1495677639202] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 1
[1495677639314]copy_to_user done:24145 [ACCEPT 0:8080] 0
[1495677639314] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0
[1495677641329]do page fault done:24145 [ACCEPT 0:8080] 622
[1495677641329]copy_to_user done:24145 [ACCEPT 0:8080] 622
[1495677641329] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 622
@Ricardo Biehl Pasquali thanks for your advise. Apache trafficserver read/write to hdd in seperate threads. trafficserver have 1 thread to accept connections, 88(8 threads per hdd)threads to do blocking read/write(cache content), but i don't understand why blocking read/write in other threads could lead to high latency of __do_page_fault().
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 16.00 0.00 4.00 0.00 160.00 40.00 0.05 13.50 0.00 13.50 7.75 3.10
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdj 0.00 0.00 6.00 0.00 4326.00 0.00 721.00 0.06 10.50 10.50 0.00 6.83 4.10
sdc 0.00 0.00 2.00 0.00 1472.00 0.00 736.00 0.04 18.50 18.50 0.00 9.50 1.90
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 14.00 0.00 10464.00 0.00 747.43 0.04 2.71 2.71 0.00 1.79 2.50
Update:
Problem solved! The root cause is trafficserver periodly call system() to backup file in an ancient plugin we wrote.
Well, the reasonable explanation is that you are taking a page fault. One possible reason would be that the server forks/exits depending on how the load changes and the new child has to fault the page, but as a minor fault it should not take long unless there is memory shortage.
Another one is there is memory shortage and you are in fact swapping.
Either way, given you have come this far I don't see what's the problem going further. You want to put the probe on __do_page_fault and continue from there.