I'm trying to train a Detectron2 model with a COCO dataset. My dataset seems to load correctly. But when I try to train the model using the DefaultTrainer
I get
TypeError: Caught TypeError in DataLoader worker process 1.
This is my setup:
from detectron2.engine import DefaultTrainer
# TOTAL_NUM_IMAGES = 10531
cfg = get_cfg()
cfg.OUTPUT_DIR = os.path.join('./output')
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("my_dataset_train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR
# single_iteration = cfg.SOLVER.IMS_PER_BATCH
# iterations_for_one_epoch = TOTAL_NUM_IMAGES / single_iteration
# cfg.SOLVER.MAX_ITER = int(iterations_for_one_epoch) * 20
cfg.SOLVER.STEPS = [] # do not decay learning rate
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (person). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
And I get this error after a couple of iterations:
[01/06 15:14:00 d2.utils.events]: eta: 11:25:20 iter: 125 total_loss: 0.9023 loss_cls: 0.1827 loss_box_reg: 0.1385 loss_mask: 0.5601 loss_rpn_cls: 0.009945 loss_rpn_loc: 0.0023 time: 0.5232 data_time: 0.3085 lr: 3.1219e-05 max_mem: 3271M
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-8c48e6e17647> in <module>()
26 trainer = DefaultTrainer(cfg)
27 trainer.resume_or_load(resume=False)
---> 28 trainer.train()
8 frames
/usr/local/lib/python3.7/dist-packages/torch/_utils.py in reraise(self)
432 # instantiate since we don't know how to
433 raise RuntimeError(msg) from None
--> 434 raise exception
435
436
TypeError: Caught TypeError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/common.py", line 201, in __iter__
yield self.dataset[idx]
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/common.py", line 90, in __getitem__
data = self._map_func(self._dataset[cur_idx])
File "/usr/local/lib/python3.7/dist-packages/detectron2/utils/serialize.py", line 26, in __call__
return self._obj(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/dataset_mapper.py", line 189, in __call__
self._transform_annotations(dataset_dict, transforms, image_shape)
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/dataset_mapper.py", line 128, in _transform_annotations
for obj in dataset_dict.pop("annotations")
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/dataset_mapper.py", line 129, in <listcomp>
if obj.get("iscrowd", 0) == 0
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/detection_utils.py", line 297, in transform_instance_annotations
p.reshape(-1) for p in transforms.apply_polygons(polygons)
File "/usr/local/lib/python3.7/dist-packages/fvcore/transforms/transform.py", line 297, in <lambda>
return lambda x: self._apply(x, name)
File "/usr/local/lib/python3.7/dist-packages/fvcore/transforms/transform.py", line 291, in _apply
x = getattr(t, meth)(x)
File "/usr/local/lib/python3.7/dist-packages/fvcore/transforms/transform.py", line 150, in apply_polygons
return [self.apply_coords(p) for p in polygons]
File "/usr/local/lib/python3.7/dist-packages/fvcore/transforms/transform.py", line 150, in <listcomp>
return [self.apply_coords(p) for p in polygons]
File "/usr/local/lib/python3.7/dist-packages/detectron2/data/transforms/transform.py", line 150, in apply_coords
coords[:, 0] = coords[:, 0] * (self.new_w * 1.0 / self.w)
TypeError: can't multiply sequence by non-int of type 'float'
Turns out some of the id's in "annotations" where written in scientific notation resulting in some id's with type float. Converting these to integers solved the problem.