Building a Serverless PDF Text Recognition Using Function Compute with Node.js in 10 Minutes

By Johnson Chiang, Solutions Architect

Alibaba Cloud Function Compute (FC) is a, serverless FaaS with an event-driven programming model. This tutorial demonstrates how you can develop a PDF-to-Text conversion function with Function Compute, and you will see the simple yet powerful paradigm of FC to implement such helper service.

What You Will Learn

This tutorial is organized into the following sections. Each section represents a specific task when developing a Function Compute service:

  1. Write Function Codes

Prerequisites

Preparing OSS:

  1. Make sure OSS is activated.

Preparing FC:

  1. Make sure Function Compute service is activated.

Write Function Codes

Currently FC supports runtimes including Java/Python/PHP/Node.js. We will code upon Node.js and use the npm pdfreader module to read text from PDF files.

  1. Under your (for example, /tmp/pdf-to-text), install and test the pdfreader module using npm:
  1. Create index.js under and code the FC handler. The codes of index.js are shown below; it implements the event handler which will be invoked when the FC function is triggered.
  • // required modules var OSS = require('ali-oss').Wrapper; // FC built-in module var PdfReader = require("pdfreader").PdfReader; // packaged 3rd-party PDF parser module console.log('Loading function'); module.exports.handler = function (eventBuf, ctx, callback) { console.log('Received event:', eventBuf.toString()); let eventObj = JSON.parse(eventBuf); let ossEvent = eventObj.events[0]; let ossRegion = "oss-" + ossEvent.region; // Init oss client instance where credentials can be retrieved from context. let ossClient = new OSS({ region: ossRegion, accessKeyId: ctx.credentials.accessKeyId, accessKeySecret: ctx.credentials.accessKeySecret, stsToken: ctx.credentials.securityToken }); ossClient.useBucket(ossEvent.oss.bucket.name); // Bucket name is from OSS event // Source PDF from "in/<filename>.pdf", processed to "out/<filename>.txt" let newKey = ossEvent.oss.object.key.replace("in/", "out/").replace(".pdf", ".txt"); // Parse PDF to text console.log("Getting object: " + ossEvent.oss.object.key); ossClient.get(ossEvent.oss.object.key).then(function (val) { let pdfBuf = val.content; let convertedTxt = ""; console.log("Start parsing PDF buffer."); new PdfReader().parseBuffer(pdfBuf, function(err, item) { if (err) { console.error("Failed to read PDF binary"); callback (err); return; } if (!item) { console.log("Done parsing text."); const outBuf = Buffer.from(convertedTxt, "utf8"); // Upload converted text as buffer to "out" directory ossClient.put(newKey, outBuf).then(function (val) { console.log("Put object: ", val); callback(null, val); return; }).catch(function (err) { console.error("Failed to put object: %j", err); callback(err); return; }); return; } if (item.text) { console.log("Continue parsed text: " + item.text); convertedTxt += item.text; } }); }).catch (function (err) { console.error("Failed to get object: %j", err); callback(err); return; }); };
  1. More often, after you package up the index.js and node_modules into a ZIP, you can directly upload the ZIP package via either the FC console or fcli command line tool. However, when the ZIP deployment package exceeds 50Mb, the maximum file size FC allows, we need to trim the size by identifying and removing unnecessary large files. In this case, we delete ./node_modules/pdf2json/test directory not used in runtime and then ensure the repackaged ZIP file (pdf-to-text.zip) is now skinny enough for upload. Check Install third-party dependencies to learn more.
  • $ ls -l; du -hs . total 8 -rw-r--r--@ 1 owner staff 2600 Jan 21 20:00 index.js drwxr-xr-x 5 owner staff 170 Jan 21 20:00 node_modules 180M . $ du -h -d3 | sort -nr | head -n8 660K ./node_modules/pdf2json/node_modules 180M ./node_modules 180M . 178M ./node_modules/pdf2json 176M ./node_modules/pdf2json/test 108K ./node_modules/pdf2json/lib 88K ./node_modules/pdf2json/.idea 28K ./node_modules/pdfreader/lib $ zip pdf-to-text.zip index.js node_modules/ adding: index.js (deflated 63%) adding: node_modules/ (stored 0%) $ ls -lh pdf-to-text.zip -rw-r--r-- 1 owner staff 1.3K Jan 21 20:10 pdf-to-text.z

You can download the working ZIP deployment package to proceed to next step.

Configure Service and Function

We will primarily be using the Alibaba Cloud Console to complete this task. In our case, all Alibaba Cloud resources are in the same region, ap-southeast-1.

  1. Configure Service: Log on FC console and create a Service such as FileConvertService.
  1. Role Config: authorize policies including AliyunOSSFullAccess, AliyunLogFullAccess, and AliyunFCFullAccess.
  1. Log Configs (optional): bind a Log Project and Log Store. This is strongly recommended to debug runtime errors using the Log function.

By completing above configurations, you have created the PDF-to-Text function with OSS event trigger.

Invoke Function

Next, to test the conversion function, you will upload the sample PDF file to OSS <YOUR_BUCKET>/in to invoke the FC function.

Then, check the <YOUR_BUCKET>/out, and see the pdf-sample.txt created and view the texts recognized from the PDF file. That's it.

Troubleshooting

When you implement your own FC, you will always run a testing and debugging cycle. Listed here are two common errors you may potentially encounter, and the corresponding troubleshooting tips:

  1. Use FC logging to troubleshoot runtime execution errors: The Log is definitely the swiss army knife you will rely on to debug any runtime error. To use the FC Log, you will need to activate Log Service, and enable the Log Configs of the FC service. Then you will be able to iteratively test the function and view logs for your function execution. Following example shows a runtime error pointing out a Node.js ReferenceError:
  1. For more information, see documentation at Log Service.

What’s Next?

In this tutorial, you have completed a quick and powerful file conversion service using FC with OSS trigger. Here are some suggestions for you to get more information we recommend for next:

  1. Visit our product documentation of Function Compute.

Reference:https://www.alibabacloud.com/blog/building-a-serverless-pdf-text-recognition-using-function-compute-with-node-js-in-10-minutes_594429?spm=a2c41.12548475.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.