Search code examples
amazon-web-servicesaws-lambdaamazon-transcribe

Get result from Amazon Transcribe directly (serverless)


I use serverless Lambda services to transcribe from speech to text with Amazon Transcribe. My current scripts are able to transcribe file from S3 and store the result as a JSON file also in S3.

Is there a possibility to get the result directly, because I want to store it in a database (PostgreSQL in AWS RDS)?

Thank you for your hints

serverless.yml

...
provider:
  name: aws
  runtime: nodejs10.x
  region: eu-central-1
  memorySize: 128
  timeout: 30
  environment:
    S3_AUDIO_BUCKET: ${self:service}-${opt:stage, self:provider.stage}-records
    S3_TRANSCRIPTION_BUCKET: ${self:service}-${opt:stage, self:provider.stage}-transcriptions
    LANGUAGE_CODE: de-DE
  iamRoleStatements:
    - Effect: Allow
      Action:
        - s3:PutObject
        - s3:GetObject
      Resource:
        - 'arn:aws:s3:::${self:provider.environment.S3_AUDIO_BUCKET}/*'
        - 'arn:aws:s3:::${self:provider.environment.S3_TRANSCRIPTION_BUCKET}/*'
    - Effect: Allow
      Action:
        - transcribe:StartTranscriptionJob
      Resource: '*'

functions:

  transcribe:
    handler: handler.transcribe
    events:
      - s3:
          bucket: ${self:provider.environment.S3_AUDIO_BUCKET}
          event: s3:ObjectCreated:*

  createTextinput:
    handler: handler.createTextinput
    events:
      - http:
          path: textinputs
          method: post
          cors: true
...

resources:
  Resources:
    S3TranscriptionBucket:
      Type: 'AWS::S3::Bucket'
      Properties:
        BucketName: ${self:provider.environment.S3_TRANSCRIPTION_BUCKET}  
...

handler.js

const db = require('./db_connect');

const awsSdk = require('aws-sdk');

const transcribeService = new awsSdk.TranscribeService();

module.exports.transcribe = (event, context, callback) => {
  const records = event.Records;

  const transcribingPromises = records.map((record) => {
    const recordUrl = [
      'https://s3.amazonaws.com',
      process.env.S3_AUDIO_BUCKET,
      record.s3.object.key,
    ].join('/');

    // create random filename to avoid conflicts in amazon transcribe jobs

    function makeid(length) {
       var result           = '';
       var characters       = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
       var charactersLength = characters.length;
       for ( var i = 0; i < length; i++ ) {
          result += characters.charAt(Math.floor(Math.random() * charactersLength));
       }
       return result;
    }

    const TranscriptionJobName = makeid(7);

    return transcribeService.startTranscriptionJob({
      LanguageCode: process.env.LANGUAGE_CODE,
      Media: { MediaFileUri: recordUrl },
      MediaFormat: 'wav',
      TranscriptionJobName,
      //MediaSampleRateHertz: 8000, // normally 8000 if you are using wav file
      OutputBucketName: process.env.S3_TRANSCRIPTION_BUCKET,
    }).promise();
  });

  Promise.all(transcribingPromises)
    .then(() => {
      callback(null, { message: 'Start transcription job successfully' });
    })
    .catch(err => callback(err, { message: 'Error start transcription job' }));
};

module.exports.createTextinput = (event, context, callback) => {
  context.callbackWaitsForEmptyEventLoop = false;
  const data = JSON.parse(event.body);
  db.insert('textinputs', data)
    .then(res => {
      callback(null,{
        statusCode: 200,
        body: "Textinput Created! id: " + res
      })
    })
    .catch(e => {
      callback(null,{
        statusCode: e.statusCode || 500,
        body: "Could not create a Textinput " + e
      })
    }) 
};

Solution

  • I think your best option is to trigger a lambda from the s3 event when a transcription is stored, and then post the data to your database. As Dunedan mentioned, you can't go directly from transcribe to a DB.

    You can add the event to a lambda via serverless like so:

    storeTranscriptonInDB:
      handler: index.storeTransciptInDB
      events:
        - s3:
            bucket: ${self:provider.environment.S3_TRANSCRIPTION_BUCKET}
            rules:
              - suffix: .json
    

    The s3 key for the transcript file will be event.Records[#].s3.object.key I would loop through the records to be thorough, and for each do something like this:

    const storeTransciptInDB = async (event, context, callback) => {
      const records = event.Records;
      for (record of event.Records) {
        let key = record.s3.object.key;
        let params = {
          Bucket: record.s3.bucket.name,
          Key: key
        }
        let transcriptFile = await s3.getObject(params).promise();
        let transcriptObject = JSON.parse(data.Body.toString("utf-8"));
        let transcriptResults = transcriptObject.results.transcripts;
        let transcript = "";
        transcriptResults.forEach(result => (transcript += result.transcript + " "));
        // at this point you can post the transcript variable to your database
      }
    }