I understand Convolutional neural networks can be used to fix this problem, but if you look at videos of self driving cars, like tesla autopilot, they still use vision detection and labeling systems as input for their neural networks. I am wondering how the self driving cars fix the problem of having N possible number of detection objects and for each of the inputs there are a varing number of information to input about them. As a neural network structure is very rigid, I would imagine that this would cause a problem. Any explanation would be greatly helpful; however, if you do have a scientific paper that would be very appreciated!
These networks do not output a class label such as car, person or sidewalk, rather a probability distribution over N objects. The final decision is later made, basically taking the highest rated object in terms of probability as the prediction. The model is trained on lots of images and as you said all of these images contain a varying numbers of objects but since the model itself output probabilities for all N objects regardless of the number of objects in the input, this is already something that model is trained for. So they learn to output probabilities close to 0 for objects types if they are not extant in the image.
Since this is something that they are trained for they can also do it during the inference. Of course, some problems might occur if certain object type is very rare in the data but this is a class imbalance issue.