python tensorflow computer-vision object-detection

How do I train the DeepSORT tracker for custom class?

I want to detect and count the number of vines in a vineyard using Deep Learning and Computer Vision techniques. I am using the YOLOv4 object detector and training on the darknet framework. I have been able to integrate the SORT tracker into my application and it works well, but I still have the following issues:

The tracker sometimes reassigns a new ID to the object
The detector sometimes misidentifies the object (which lead to incorrect tracking)
The tracker sometimes does not track a detected object.

You can see an example of the reassignment issue in the following image. As you can see, in frame 40 the id 9 was a metal post, and frame 42 onwards it is being assigned to a tree

In searching for the cause of these problems, I have learnt that DeepSORT is an improved version of the SORT, which aims to handle this problem by using a Neural Network for associating tracks to detections.

Problem:

The problem I am facing is with the training of this particular model for Deepsort. I have seen that the authors have used cosine metric learning to train their model, but I am not being able to customize the learning for my custom classes. The questions I have are as follows:

I have a dataset of annotated (YOLO TXT format) images which I have used to train the YOLOv4 model. Can I reuse the same dataset for the Deepsort tracker? If so, then how?
If I cannot reuse the dataset, then how do I create my own dataset for training the model?

Thanks in advance for the help!

Solution

You will probably get output of yolo in text file (class_id, x, y, width height, confidence). Then you want to tell deep sort to start tracking from yolo detected coordinates. Which will be modification of DeepSort original implementation something like this (this implementation also tells to use its ids, not ids from yolo).

def update(self, bbox_xywh, confidences, class_ids, ori_img):
    self.height, self.width = ori_img.shape[:2]
    # generate features
    features = self._get_features(bbox_xywh, ori_img)

    # Prepare detections for the tracker
    # Adding a default oid (object ID) for compatibility with the Detection class
    default_oid = 0
    detections = [Detection(bbox, conf, feature, default_oid) for bbox, conf, feature in zip(bbox_xywh, confidences, features)]

    # update tracker
    self.tracker.predict()
    self.tracker.update(detections)

    # output bbox identities
    outputs = []
    for track in self.tracker.tracks:
        if not track.is_confirmed() or track.time_since_update > 1:
            continue
        box = track.to_tlwh()
        x1, y1, x2, y2 = self._tlwh_to_xyxy(box)
        track_id = track.track_id
        outputs.append(np.array([x1, y1, x2, y2, track_id], dtype=int))
    if len(outputs) > 0:
        outputs = np.stack(outputs, axis=0)
    return outputs

Then, in main class you will need to write methods that will sort yolo data , load, convert to absolute coordinates, draw bbox format, and to give you output(you can get output in text (which will be absoulute coordinates), frames or video). All of those methods you can find on the internet and modify it for your needs. Keep me posted, hope this will help because step with changing original implementation helped me.