ios11 scenekit augmented-reality arkit coreml

Is it possible to get a "SCNVector3" position of a World object using CoreML and ARKit?

I am working on a AR based solution in which I am rendering some 3D models using SceneKit and ARKit. I have also integrated CoreML to identify objects and render corresponding 3D objects in scene.

But right now I am just rendering it in the center of screen as soon I detect the object(Only for the list of objects that I have). Is it possible to get the position of the real world object so that I can show some overlay above the object?

That is if I have a water bottled scanned, I should able to get the position of the water bottle. It could be anywhere in the water bottle but shouldn't go outside of it. Is this possible using SceneKit?

Solution

All parts of what you ask are theoretically possible, but a) for several parts, there’s no integrated API to do things for you, and b) you’re probably signing yourself up for a more difficult problem than you think.

What you presumably have with your Core ML integration is an image classifier, as that’s what most of the easy to find ML models do. Image classification answers one question: “what is this a picture of?”

What you’re looking for involves at least two additional questions:

“Given that this image has been classified as containing (some specific object), where in the 2D image is that object?”
“Given the position of a detected object in the 2D video image, where is it in the 3D space tracked by ARKit?”

Question 1 is pretty reasonable. There are models that do both classification and detection (location/bounds within an image) in the ML community. Probably the best known one is YOLO — here’s a blog post about using it with Core ML.

Question 2 is the “research team and five years” part. You’ll notice in the YOLO papers that it gives you only coarse bounding boxes for detected objects — that is, it’s working in 2D image space, not doing 3D scene reconstruction.

To really know the shape, or even the 3D bounding box of an object means integrating object detection with scene reconstruction. For example, if an object has some height in the 2D image, are you looking at a 3D object that’s tall with a small footprint, or one that’s long and low, receding into the distance? Such integration would require taking apart the inner workings of ARKit, which nobody outside Apple can do, or recreating an ARKit-alike from scratch.

There might be some assumptions you can make to get very rough estimates of 3D shape from a 2D bounding box, though. For example, if you do AR hit tests on the lower corners of a box and find that they’re on a horizontal plane, you can guess that the 2D height of the box is proportional to the 3D height of the object, and that its footprint on the plane is proportional to the box’s width. You’d have to do some research and testing to see if assumptions like that hold up, especially in whatever use cases your app covers.