Search code examples
web-crawlerstormcrawler

StormCrawler throws Halting due to Out Of Memory Error


Working on storm crawler 1.13 and elastic search 6.5.2. Below is my crawler configuration. I am Crawling a website which has millions of docs. Crawler doesn't give me any kind of errors if I perform domain specific crawling by applying fast.urlfilter.json. When I pointed to the main domain by applying "ignoreOutsideHost": false,"ignoreOutsideDomain": true it throws me java.lang.OutOfMemoryError: Java heap space and Halting due to Out Of Memory Error...FetcherThread #0. Any solution for smooth crawling without any memory errors. Click for crawler configuration and Detailed logs updated below.

Thanks in advance and apologize for huge post .

worker.log:

2019-01-22 08:31:51.989 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://arts.test.edu/login/?next=/schools/film-animation/other-school-film-and-animation-festivals-and-awards/test-film-and-animation-awards-1998 with status 200 in msec 107

2019-01-22 08:31:56.815 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://portfolios.test.edu/search?tags=Othello with status 200 in msec 162

2019-01-22 08:32:46.572 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://spiff.test.edu/richmond/testobs/jul25_2013/?C=S;O=A with status 200 in msec 3

2019-01-22 08:32:01.862 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://campusgroups.test.edu/slu/members/ with status 200 in msec 229

2019-01-22 08:32:06.693 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://arts.test.edu/news/16 with status 200 in msec 119

2019-01-22 08:32:11.601 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.apply.test.edu  is set to 10000 as per robots.txt. url: https://www.apply.test.edu/news/testapply-holds-student-research-fair

2019-01-22 08:32:13.765 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.apply.test.edu/news/testapply-holds-student-research-fair with status 200 in msec 2164

2019-01-22 08:32:16.616 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://apps.test.edu/cos/scms/equipment/schedules.php?id=25&date=9-21-2019 with status 200 in msec 46

2019-01-22 08:32:21.780 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://edge.test.edu/edge/P19319/public/FILENAME.docx with status 200 in msec 156

2019-01-22 08:32:27.837 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://applywebdev.test.edu/news/booth-biography-selected-national-reading-project?page=6 with status 200 in msec 1231

2019-01-22 08:32:30.075 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://applywebdev.test.edu/news/grant-improve-problem-solving-skills-deaf-and-hard-hearing-students?page=6 with status 200 in msec 1235

2019-01-22 08:32:31.775 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://portfolios.test.edu/search?tags=feedback with status 200 in msec 197

2019-01-22 08:32:36.582 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: infoguides.test.edu  is set to 10000 as per robots.txt. url: http://infoguides.test.edu/c.php?g=357360&p=4416876

2019-01-22 08:32:36.693 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://infoguides.test.edu/c.php?g=357360&p=4416876 with status 200 in msec 111

2019-01-22 08:32:41.602 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.sic.test.edu  is set to 10000 as per robots.txt. url: https://www.sic.test.edu/news/sic-undergraduate-research-sparks-prestigious-professorship-astronomy?page=10

2019-01-22 08:32:42.455 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.sic.test.edu/news/sic-undergraduate-research-sparks-prestigious-professorship-astronomy?page=10 with status 200 in msec 853

2019-01-22 08:32:46.572 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched http://spiff.test.edu/richmond/testobs/jul25_2013/?C=S;O=A with status 200 in msec 3

2019-01-22 08:32:51.595 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] Crawl delay for queue: www.apply.test.edu  is set to 10000 as per robots.txt. url: https://www.apply.test.edu/news/testapply-students-graduate-accolades

2019-01-22 08:32:53.748 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://www.apply.test.edu/news/testapply-students-graduate-accolades with status 200 in msec 2152

2019-01-22 08:33:01.976 c.d.s.b.FetcherBolt FetcherThread #0 [INFO] [Fetcher #3] Fetched https://inside.test.edu/?date=2023-12-1&t=list with status 200 in msec 355

2019-01-22 08:33:11.957 STDIO FetcherThread #0 [ERROR] Halting due to Out Of Memory Error...FetcherThread #0

2019-01-22 08:33:11.960 STDERR Thread-2 [INFO] java.lang.OutOfMemoryError: Java heap space
2019-01-22 08:33:11.968 STDERR Thread-2 [INFO] Dumping heap to artifacts/heapdump ...
2019-01-22 08:33:11.968 STDERR Thread-2 [INFO] Unable to create artifacts/heapdump: File exists

supervisor.log:

2019-01-22 08:31:40.341 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Created Worker ID da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] Setting up 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.341 o.a.s.d.s.Container SLOT_6700 [INFO] SET worker-user da2944c7-cfd2-409a-856b-84f0a0014f56 testweb
2019-01-22 08:31:40.342 o.a.s.d.s.Container SLOT_6700 [INFO] Creating symlinks for worker-id: da2944c7-cfd2-409a-856b-84f0a0014f56 storm-id: www-staging-crawler-4-1548106042 for files(1): [resources]
2019-01-22 08:31:40.342 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with assignment LocalAssignment(topology_id:www-staging-crawler-4-1548106042, executors:[ExecutorInfo(task_start:8, task_end:8), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:5, task_end:5)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:testweb) for this supervisor 164ddb0a-fcba-41e3-9a14-386248370bcf on port 6700 with id da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:40.342 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with command: 'java' '-cp' '/home/testweb/apps/crawler/apache-storm-1.2.2/lib/*:/home/testweb/apps/crawler/apache-storm-1.2.2/extlib/*:/home/testweb/crawler/apache-storm-1.2.2/conf:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/stormjar.jar' '-Xmx64m' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/home/testweb/apps/crawler/apache-storm-1.2.2' '-Dworkers.artifacts=/home/testweb/var/logs/workers-artifacts' '-Dstorm.id=www-staging-crawler-4-1548106042' '-Dworker.id=da2944c7-cfd2-409a-856b-84f0a0014f56' '-Dworker.port=6700' '-Dstorm.log.dir=/home/testweb/var/logs' '-Dlog4j.configurationFile=/home/testweb/apps/crawler/apache-storm-1.2.2/log4j2/worker.xml' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=storm-local' 'org.apache.storm.LogWtester' 'java' '-server' '-Dlogging.sensitivity=S3' '-Dlogfile.name=worker.log' '-Dstorm.home=/home/testweb/apps/crawler/apache-storm-1.2.2' '-Dworkers.artifacts=/home/testweb/var/logs/workers-artifacts' '-Dstorm.id=www-staging-crawler-4-1548106042' '-Dworker.id=da2944c7-cfd2-409a-856b-84f0a0014f56' '-Dworker.port=6700' '-Dstorm.log.dir=/home/testweb/var/logs' '-Dlog4j.configurationFile=/home/testweb/apps/crawler/apache-storm-1.2.2/log4j2/worker.xml' '-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector' '-Dstorm.local.dir=storm-local' '-Xmx2048m' '-XX:+PrintGCDetails' '-Xloggc:artifacts/gc.log' '-XX:+PrintGCDateStamps' '-XX:+PrintGCTimeStamps' '-XX:+UseGCLogFileRotation' '-XX:NumberOfGCLogFiles=10' '-XX:GCLogFileSize=1M' '-XX:+HeapDumpOnOutOfMemoryError' '-XX:HeapDumpPath=artifacts/heapdump' '-Djava.library.path=/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/resources/Linux-amd64:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/resources:/usr/local/lib:/opt/local/lib:/usr/lib' '-Dstorm.conf.file=' '-Dstorm.options=' '-Djava.io.tmpdir=/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/tmp' '-cp' '/home/testweb/apps/crawler/apache-storm-1.2.2/lib/*:/home/testweb/apps/crawler/apache-storm-1.2.2/extlib/*:/home/testweb/crawler/apache-storm-1.2.2/conf:/home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/supervisor/stormdist/www-staging-crawler-4-1548106042/stormjar.jar' 'org.apache.storm.daemon.worker' 'www-staging-crawler-4-1548106042' '164ddb0a-fcba-41e3-9a14-386248370bcf' '6700' 'da2944c7-cfd2-409a-856b-84f0a0014f56'. 
2019-01-22 08:31:40.344 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL_AND_RELAUNCH msInState: 18 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> WAITING_FOR_WORKER_START msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:31:45.350 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_WORKER_START msInState: 5006 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> RUNNING msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:12.328 o.a.s.d.s.BasicContainer Thread-2505 [INFO] Worker Process da2944c7-cfd2-409a-856b-84f0a0014f56 exited with code: 255
2019-01-22 08:33:12.370 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700: main process has exited
2019-01-22 08:33:12.370 o.a.s.d.s.Container SLOT_6700 [INFO] Killing 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:12.380 o.a.s.u.Utils SLOT_6700 [INFO] Error when trying to kill 1554. Process is probably already dead.
2019-01-22 08:33:15.380 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 90030 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56 -> KILL_AND_RELAUNCH msInState: 0 topo:www-staging-crawler-4-1548106042 worker:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.381 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.394 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up 164ddb0a-fcba-41e3-9a14-386248370bcf:da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.395 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.395 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/pids/1554
2019-01-22 08:33:15.395 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/heartbeats
2019-01-22 08:33:15.399 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/pids
2019-01-22 08:33:15.399 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56/tmp
2019-01-22 08:33:15.400 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers/da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/testweb/apps/crawler/apache-storm-1.2.2/storm-local/workers-users/da2944c7-cfd2-409a-856b-84f0a0014f56
2019-01-22 08:33:15.400 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID da2944c7-cfd2-409a-856b-84f0a0014f56

gc.log.0.current:

  Java HotSpot(TM) 64-Bit Server VM (25.191-b26) for linux-amd64 JRE (1.8.0_191-b26), built on Oct  8 2018 13:54:08 by "java_re" with gcc 7.3.0
Memory: 4k page, physical 8168328k(1737328k free), swap 8387580k(8386288k free)
CommandLine flags: -XX:GCLogFileSize=1048576 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump -XX:InitialHeapSize=130693248 -XX:MaxHeapSize=2147483648 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseGCLogFileRotation -XX:+UseParallelGC 
2019-01-22T08:31:41.541-0500: 1.028: [GC (Allocation Failure) [PSYoungGen: 32768K->5096K(37888K)] 32768K->6882K(123904K), 0.0098372 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-01-22T08:31:42.155-0500: 1.642: [GC (Allocation Failure) [PSYoungGen: 37864K->5110K(37888K)] 39650K->10524K(123904K), 0.0104951 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-01-22T08:31:42.557-0500: 2.044: [GC (Metadata GC Threshold) [PSYoungGen: 24280K->5094K(37888K)] 29694K->12912K(123904K), 0.0129743 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
2019-01-22T08:31:42.570-0500: 2.057: [Full GC (Metadata GC Threshold) [PSYoungGen: 5094K->0K(37888K)] [ParOldGen: 7817K->7345K(64000K)] 12912K->7345K(101888K), [Metaspace: 21023K->21023K(1067008K)], 0.0578299 secs] [Times: user=0.13 sys=0.01, real=0.06 secs] 
2019-01-22T08:31:42.858-0500: 2.344: [GC (Allocation Failure) [PSYoungGen: 32768K->2425K(48128K)] 40113K->9771K(112128K), 0.0039971 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 
2019-01-22T08:31:43.563-0500: 3.050: [GC (Allocation Failure) [PSYoungGen: 47993K->5099K(68096K)] 55339K->15796K(132096K), 0.0183739 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] 
2019-01-22T08:31:44.248-0500: 3.735: [GC (Metadata GC Threshold) [PSYoungGen: 45605K->9669K(74752K)] 56303K->20375K(138752K), 0.0171562 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] 
2019-01-22T08:31:44.266-0500: 3.752: [Full GC (Metadata GC Threshold) [PSYoungGen: 9669K->0K(74752K)] [ParOldGen: 10705K->14480K(108032K)] 20375K->14480K(182784K), [Metaspace: 34870K->34870K(1079296K)], 0.1069368 secs] [Times: user=0.36 sys=0.01, real=0.11 secs] 
2019-01-22T08:31:45.775-0500: 5.261: [GC (GCLocker Initiated GC) [PSYoungGen: 63488K->8826K(75776K)] 77975K->23321K(183808K), 0.0103824 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2019-01-22T08:31:46.619-0500: 6.106: [GC (Allocation Failure) [PSYoungGen: 72314K->12264K(90624K)] 86844K->30380K(198656K), 0.0228691 secs] [Times: user=0.03 sys=0.00, real=0.03 secs] 
2019-01-22T08:31:47.414-0500: 6.901: [GC (Allocation Failure) [PSYoungGen: 90600K->15337K(93696K)] 108716K->33992K(201728K), 0.0215458 secs] [Times: user=0.05 sys=0.01, real=0.02 secs] 
2019-01-22T08:31:47.499-0500: 6.986: [GC (Allocation Failure) [PSYoungGen: 93636K->14043K(110080K)] 112291K->32707K(218112K), 0.0191082 secs] [Times: user=0.03 sys=0.01, real=0.02 secs] 
2019-01-22T08:31:47.565-0500: 7.052: [GC (Allocation Failure) [PSYoungGen: 106715K->13585K(111104K)] 125379K->32256K(219136K), 0.0110566 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
2019-01-22T08:31:47.975-0500: 7.461: [GC (Allocation Failure) [PSYoungGen: 106257K->9626K(148480K)] 124928K->37589K(256512K), 0.0329521 secs] [Times: user=0.07 sys=0.02, real=0.03 secs] 
2019-01-22T08:31:48.847-0500: 8.334: [GC (Metadata GC Threshold) [PSYoungGen: 120769K->5799K(149504K)] 148732K->123739K(344576K), 0.0346237 secs] [Times: user=0.07 sys=0.02, real=0.04 secs] 
2019-01-22T08:31:48.882-0500: 8.369: [Full GC (Metadata GC Threshold) [PSYoungGen: 5799K->0K(149504K)] [ParOldGen: 117940K->115617K(263680K)] 123739K->115617K(413184K), [Metaspace: 57889K->57857K(1099776K)], 0.2179918 secs] [Times: user=0.66 sys=0.01, real=0.21 secs] 
2019-01-22T08:31:56.805-0500: 16.291: [GC (Allocation Failure) [PSYoungGen: 131072K->4807K(189440K)] 246689K->120432K(453120K), 0.0092119 secs] [Times: user=0.03 sys=0.01, real=0.01 secs] 
2019-01-22T08:32:11.898-0500: 31.385: [GC (Allocation Failure) [PSYoungGen: 181447K->1713K(195072K)] 297072K->120453K(458752K), 0.0062305 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-01-22T08:32:26.904-0500: 46.391: [GC (Allocation Failure) [PSYoungGen: 178353K->981K(234496K)] 297093K->120609K(498176K), 0.0048011 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2019-01-22T08:32:47.815-0500: 67.302: [GC (Allocation Failure) [PSYoungGen: 223701K->1518K(241664K)] 343329K->121154K(505344K), 0.0102639 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
2019-01-22T08:33:07.716-0500: 87.203: [GC (Allocation Failure) [PSYoungGen: 194483K->1385K(262144K)] 314119K->121029K(525824K), 0.0059916 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-01-22T08:33:11.599-0500: 91.086: [GC (Allocation Failure) [PSYoungGen: 127845K->1390K(268288K)] 247489K->140704K(1666560K), 0.0107712 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2019-01-22T08:33:11.610-0500: 91.097: [GC (Allocation Failure) [PSYoungGen: 1390K->1401K(294400K)] 140704K->140715K(1692672K), 0.0037587 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2019-01-22T08:33:11.614-0500: 91.100: [Full GC (Allocation Failure) [PSYoungGen: 1401K->0K(294400K)] [ParOldGen: 139314K->51057K(201728K)] 140715K->51057K(496128K), [Metaspace: 60831K->60827K(1101824K)], 0.0966803 secs] [Times: user=0.24 sys=0.01, real=0.09 secs] 
2019-01-22T08:33:11.712-0500: 91.199: [GC (Allocation Failure) [PSYoungGen: 0K->0K(293888K)] 51057K->51057K(1692160K), 0.0100144 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-01-22T08:33:11.723-0500: 91.209: [Full GC (Allocation Failure) [PSYoungGen: 0K->0K(293888K)] [ParOldGen: 51057K->48333K(224768K)] 51057K->48333K(518656K), [Metaspace: 60827K->60134K(1101824K)], 0.2302426 secs] [Times: user=0.67 sys=0.01, real=0.23 secs] 
Heap
 PSYoungGen      total 293888K, used 1071K [0x00000000d5580000, 0x00000000ee180000, 0x0000000100000000)
  eden space 275968K, 0% used [0x00000000d5580000,0x00000000d568bfb8,0x00000000e6300000)
  from space 17920K, 0% used [0x00000000e6300000,0x00000000e6300000,0x00000000e7480000)
  to   space 17408K, 0% used [0x00000000ed080000,0x00000000ed080000,0x00000000ee180000)
 ParOldGen       total 1398272K, used 48333K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
  object space 1398272K, 3% used [0x0000000080000000,0x0000000082f335b0,0x00000000d5580000)
 Metaspace       used 60138K, capacity 60994K, committed 62464K, reserved 1101824K
  class space    used 9379K, capacity 9681K, committed 9984K, reserved 1048576K

worker.log.err

java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Heap dump file created [965011634 bytes in 9.400 secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Unable to create artifacts/heapdump: File exists
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
Unable to create artifacts/heapdump: File exists
java.lang.OutOfMemoryError: Java heap space
Dumping heap to artifacts/heapdump ...
.

robots.txt

User-agent: *
Crawl-delay: 10
# Directories


Solution

  • Have you tried analyzing the heap dump with JHat or VisualVM?

    UPDATE the heapdump above suggests that the memory is full with the content from the fetcher threads. The fact that you are not getting that when reducing the content limit would confirm that. Use more memory if you can or keep restricting max length, you could also have less threads running in parallel.

    Note: if you hit an endless stream e.g. radio or video, the default http will simply continue to load the content regardless of the limits set. The okhttp implementation is more reliable in that respect.