Search code examples
macosapple-visionios-vision

Getting continuous head-pose information from Vision Framework


I'm trying to get head pose information during head tracking from Vision.framework on macOS (Ventura). I'm able to get it on the first frame, but unsure of how to get it subsequently. I'm creating a face landmark detection request like this:

detectLandmarksRequest = [[VNDetectFaceLandmarksRequest alloc] initWithCompletionHandler:^(VNRequest * _Nonnull request, NSError * _Nullable error) {
    for (VNFaceObservation *observation in request.results) {
        VNTrackObjectRequest *nextTrackRequest = [[VNTrackObjectRequest alloc] initWithDetectedObjectObservation:observation];
        [_faceTrackingRequests addObject:nextTrackRequest];
    }
}];
detectLandmarksRequest.constellation = VNRequestFaceLandmarksConstellation76Points;
detectLandmarksRequest.revision = VNDetectFaceLandmarksRequestRevision3;

I then detect faces using an image request handler:

    VNImageRequestHandler *detectionHandler = [[VNImageRequestHandler alloc] initWithCVPixelBuffer:pixelBuffer
                                                                                               options:@{}];
    NSError *detectionError = nil;
    NSArray *landmarkRequest = @[detectLandmarksRequest];
    if (![detectionHandler performRequests:landmarkRequest
                                     error:&detectionError]) {
        // Handle errors
    }
    [detectionHandler release];

When the completion handler is called, it contains roll and yaw, but not pitch, despite setting the request revision to VNDetectFaceLandmarksRequestRevision3.

Because I'm doing face tracking, I do not create any further landmark requests. I'm following the same format as the VisionFaceTrack example from Apple's developer site. In the above completion block, I take the results of the face detection and create object tracking requests from them. I then call a sequence handler to track the faces across subsequent video frames. (I'm working with video on disk, not coming from a camera, fwiw.)

    NSError *trackingError = nil;
    if (![sequenceHandler performRequests:_faceTrackingRequests
                          onCVPixelBuffer:pixelBuffer
                                    error:&trackingError]) {
        // Handle errors
    }

I then take the results of the sequence handler and pull out the detected objects and make a new set of requests to track the detected objects, etc.

    NSMutableArray<VNTrackObjectRequest*> *newRequests = [NSMutableArray array];
    for (VNTrackObjectRequest *nextTrackingRequest in _faceTrackingRequests) {
        for (VNDetectedObjectObservation *observation in nextTrackingRequest.results) {
            if (observation != nil) {
                nextTrackingRequest.inputObservation = observation;
                [newRequests addObject:nextTrackingRequest];
            }
        }
    }
    [_faceTrackingRequests release];
    _faceTrackingRequests = [newRequests retain];

I get back all of the face landmarks, but at no subsequent point do I get any results that contain the roll, pitch, and yaw of the head. In fact, when I'm tracking the face landmarks, it expects a roll, pitch, and yaw that I don't have:

        NSMutableArray<VNFaceObservation*> *observations = [NSMutableArray arrayWithCapacity:nextFaceTrackingRequest.results.count];
        for (VNDetectedObjectObservation *nextObservation in nextFaceTrackingRequest.results) {
            VNFaceObservation *faceObservation = [VNFaceObservation faceObservationWithRequestRevision:VNDetectFaceLandmarksRequestRevision3
                                                                                           boundingBox:nextObservation.boundingBox
                                                                                                  roll:nil
                                                                                                   yaw:nil
                                                                                                 pitch:nil];
            
            [observations addObject:faceObservation];
        }

How does one get the roll, pitch, and yaw from Vision.framework after the first call to detect the faces, and why am I only getting roll and yaw on that call when I've requested the newer revision of the results?


Solution

  • From what I've been able to tell, face pose is not tracked in a sequence. In order to get face pose on every frame you must create a new face rectangle detection request on each frame, in addition to the other work you're doing. Calling this on every frame gets me the roll, pitch, and yaw:

    NSError *detectionError = nil;
    NSMutableArray<NSDictionary*> *poses = [NSMutableArray array];
    VNDetectFaceRectanglesRequest *facePoseRequest = [[VNDetectFaceRectanglesRequest alloc] initWithCompletionHandler:^(VNRequest * _Nonnull request, NSError * _Nullable error) {
        for (VNFaceObservation *nextObservation in request.results) {
            NSNumber *roll = nextObservation.roll;
            NSNumber *pitch = nextObservation.pitch;
            NSNumber *yaw = nextObservation.yaw;
            NSDictionary *pose = @{
                kKey_Roll : roll,
                kKey_Pitch : pitch,
                kKey_Yaw : yaw
            };
            [poses addObject:pose];
        }
    }];
    NSArray *landmarkRequest = @[facePoseRequest];
    if (![detectionHandler performRequests:landmarkRequest
                                     error:&detectionError]) {
        NSLog(@"Unable to get face pose on frame %ld", frameNumber);
    }
    [detectionHandler release];