Search code examples
iosswiftcomputer-visionapple-vision

How to use Apple's vision framework for real-time text recognition?


I can't seem to find a way to not use the document scanner, and supplement it with AVFoundation instead. I'm trying to create a feature where the user can click a button, scan text, and then save that to some textview w/o having the user click the camera button, keep scan, save, etc.

I've got it to work with object detection, but I can't get it to work for text-recognition. So, is there any way to use Apple's vision framework for real-time text recognition?


Solution

  • For performance reasons, I'd prefer to not convert the CMSampleBuffer to a UIImage, and would instead use the following to create an AVCaptureVideoPreviewLayer for live video:

    class CameraFeedView: UIView {
        private var previewLayer: AVCaptureVideoPreviewLayer!
        
        override class var layerClass: AnyClass {
            return AVCaptureVideoPreviewLayer.self
        }
        
        init(frame: CGRect, session: AVCaptureSession, videoOrientation: AVCaptureVideoOrientation) {
            super.init(frame: frame)
            previewLayer = layer as? AVCaptureVideoPreviewLayer
            previewLayer.session = session
            previewLayer.videoGravity = .resizeAspect
            previewLayer.connection?.videoOrientation = videoOrientation
        }
        
        required init?(coder: NSCoder) {
            fatalError("init(coder:) has not been implemented")
        }
    }
    

    Once you have this, you can work on the live video data using Vision:

    class CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
      
      private let videoDataOutputQueue = DispatchQueue(label: "CameraFeedDataOutput", qos: .userInitiated,
                                                       attributes: [], autoreleaseFrequency: .workItem)
      private var drawingView: UILabel = {
        let view = UILabel(frame: UIScreen.main.bounds)
        view.font = UIFont.boldSystemFont(ofSize: 30.0)
        view.textColor = .red
        view.translatesAutoresizingMaskIntoConstraints = false
        return view
      }()
      private var cameraFeedSession: AVCaptureSession?
      private var cameraFeedView: CameraFeedView! //Wrap
    
      override func viewDidLoad() {
        super.viewDidLoad()
        do {
          try setupAVSession()
        } catch {
          print("setup av session failed")
        }
      }
    
      func setupAVSession() throws {
        // Create device discovery session for a wide angle camera
        let wideAngle = AVCaptureDevice.DeviceType.builtInWideAngleCamera
        let discoverySession = AVCaptureDevice.DiscoverySession(deviceTypes: [wideAngle], mediaType: .video, position: .back)
        
        // Select a video device, make an input
        guard let videoDevice = discoverySession.devices.first else {
          print("Could not find a wide angle camera device.")
        }
        
        guard let deviceInput = try? AVCaptureDeviceInput(device: videoDevice) else {
          print("Could not create video device input.")
        }
        
        let session = AVCaptureSession()
        session.beginConfiguration()
        // We prefer a 1080p video capture but if camera cannot provide it then fall back to highest possible quality
        if videoDevice.supportsSessionPreset(.hd1920x1080) {
          session.sessionPreset = .hd1920x1080
        } else {
          session.sessionPreset = .high
        }
        
        // Add a video input
        guard session.canAddInput(deviceInput) else {
          print("Could not add video device input to the session")
        }
        session.addInput(deviceInput)
        
        let dataOutput = AVCaptureVideoDataOutput()
        if session.canAddOutput(dataOutput) {
          session.addOutput(dataOutput)
          // Add a video data output
          dataOutput.alwaysDiscardsLateVideoFrames = true
          dataOutput.videoSettings = [
            String(kCVPixelBufferPixelFormatTypeKey): Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)
          ]
          dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
        } else {
          print("Could not add video data output to the session")
        }
        let captureConnection = dataOutput.connection(with: .video)
        captureConnection?.preferredVideoStabilizationMode = .standard
        captureConnection?.videoOrientation = .portrait
        // Always process the frames
        captureConnection?.isEnabled = true
        session.commitConfiguration()
        cameraFeedSession = session
        
        // Get the interface orientaion from window scene to set proper video orientation on capture connection.
        let videoOrientation: AVCaptureVideoOrientation
        switch view.window?.windowScene?.interfaceOrientation {
          case .landscapeRight:
            videoOrientation = .landscapeRight
          default:
            videoOrientation = .portrait
        }
        
        // Create and setup video feed view
        cameraFeedView = CameraFeedView(frame: view.bounds, session: session, videoOrientation: videoOrientation)
        setupVideoOutputView(cameraFeedView)
        cameraFeedSession?.startRunning()
      }
    

    The key functions to implement once you've got an AVCaptureSession set up are the delegate and request handler:

      func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        
        let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .down)
        
        let request = VNRecognizeTextRequest(completionHandler: textDetectHandler)
        
        do {
          // Perform the text-detection request.
          try requestHandler.perform([request])
        } catch {
          print("Unable to perform the request: \(error).")
        }
      }
      
      func textDetectHandler(request: VNRequest, error: Error?) {
        guard let observations =
                request.results as? [VNRecognizedTextObservation] else { return }
        // Process each observation to find the recognized body pose points.
        let recognizedStrings = observations.compactMap { observation in
            // Return the string of the top VNRecognizedText instance.
            return observation.topCandidates(1).first?.string
        }
        
        DispatchQueue.main.async {
          self.drawingView.text = recognizedStrings.first
        }
      }
    }
    

    Note, you will probably want to process each of the recognizedStrings in order to choose the one with the highest confidence, but this is a proof of concept. You could also add a bounding box, and the docs have an example of that.