Building a Serverless PDF Text Recognition Using Function Compute with Node.js in 10 Minutes

What You Will Learn

  1. Write Function Codes
  2. How to integrate a 3rd-party Node.js library and built-in OSS library as FC codes, in Node.js.
  3. How to package the integrated codes as a ZIP deployment file.
  4. Configure Service and Function
  5. How to create a FC function with OSS trigger by deploying the packaged ZIP file.
  6. Invoke Function
  7. Test the function by posting a sample PDF file onto the source directory, and verify the function is triggered to extract texts from the PDF file and write extracted texts onto the output directory.
  8. Troubleshooting
  9. How to use handy tools to debug those problems you will easily hit when developing a FC service.

Prerequisites

  1. Make sure OSS is activated.
  2. Log on OSS console, create a bucket <YOUR_BUCKET>.
  3. Under the bucket, create two directories: /in and /out, where you will upload source PDF files to the former and converted output text files will be placed in the latter.
  1. Make sure Function Compute service is activated.
  2. Download the working ZIP deployment package.
  3. Download the sample PDF file.

Write Function Codes

  1. Under your (for example, /tmp/pdf-to-text), install and test the pdfreader module using npm:
  1. Create index.js under and code the FC handler. The codes of index.js are shown below; it implements the event handler which will be invoked when the FC function is triggered.
  • // required modules var OSS = require('ali-oss').Wrapper; // FC built-in module var PdfReader = require("pdfreader").PdfReader; // packaged 3rd-party PDF parser module console.log('Loading function'); module.exports.handler = function (eventBuf, ctx, callback) { console.log('Received event:', eventBuf.toString()); let eventObj = JSON.parse(eventBuf); let ossEvent = eventObj.events[0]; let ossRegion = "oss-" + ossEvent.region; // Init oss client instance where credentials can be retrieved from context. let ossClient = new OSS({ region: ossRegion, accessKeyId: ctx.credentials.accessKeyId, accessKeySecret: ctx.credentials.accessKeySecret, stsToken: ctx.credentials.securityToken }); ossClient.useBucket(ossEvent.oss.bucket.name); // Bucket name is from OSS event // Source PDF from "in/<filename>.pdf", processed to "out/<filename>.txt" let newKey = ossEvent.oss.object.key.replace("in/", "out/").replace(".pdf", ".txt"); // Parse PDF to text console.log("Getting object: " + ossEvent.oss.object.key); ossClient.get(ossEvent.oss.object.key).then(function (val) { let pdfBuf = val.content; let convertedTxt = ""; console.log("Start parsing PDF buffer."); new PdfReader().parseBuffer(pdfBuf, function(err, item) { if (err) { console.error("Failed to read PDF binary"); callback (err); return; } if (!item) { console.log("Done parsing text."); const outBuf = Buffer.from(convertedTxt, "utf8"); // Upload converted text as buffer to "out" directory ossClient.put(newKey, outBuf).then(function (val) { console.log("Put object: ", val); callback(null, val); return; }).catch(function (err) { console.error("Failed to put object: %j", err); callback(err); return; }); return; } if (item.text) { console.log("Continue parsed text: " + item.text); convertedTxt += item.text; } }); }).catch (function (err) { console.error("Failed to get object: %j", err); callback(err); return; }); };
  1. More often, after you package up the index.js and node_modules into a ZIP, you can directly upload the ZIP package via either the FC console or fcli command line tool. However, when the ZIP deployment package exceeds 50Mb, the maximum file size FC allows, we need to trim the size by identifying and removing unnecessary large files. In this case, we delete ./node_modules/pdf2json/test directory not used in runtime and then ensure the repackaged ZIP file (pdf-to-text.zip) is now skinny enough for upload. Check Install third-party dependencies to learn more.
  • $ ls -l; du -hs . total 8 -rw-r--r--@ 1 owner staff 2600 Jan 21 20:00 index.js drwxr-xr-x 5 owner staff 170 Jan 21 20:00 node_modules 180M . $ du -h -d3 | sort -nr | head -n8 660K ./node_modules/pdf2json/node_modules 180M ./node_modules 180M . 178M ./node_modules/pdf2json 176M ./node_modules/pdf2json/test 108K ./node_modules/pdf2json/lib 88K ./node_modules/pdf2json/.idea 28K ./node_modules/pdfreader/lib $ zip pdf-to-text.zip index.js node_modules/ adding: index.js (deflated 63%) adding: node_modules/ (stored 0%) $ ls -lh pdf-to-text.zip -rw-r--r-- 1 owner staff 1.3K Jan 21 20:10 pdf-to-text.z

Configure Service and Function

  1. Configure Service: Log on FC console and create a Service such as FileConvertService.
  1. Role Config: authorize policies including AliyunOSSFullAccess, AliyunLogFullAccess, and AliyunFCFullAccess.
  1. Log Configs (optional): bind a Log Project and Log Store. This is strongly recommended to debug runtime errors using the Log function.
  2. Configure Function: Under the Service, create a Function such as pdf2Text. The Function are configured as follows:
  3. Code:
  4. Runtime — nodejs6 (or nodejs8)
  5. Code Configuration — Upload the .zip file.
  6. Trigger: Create an OSS trigger with following configurations so that the function will be triggered whenever a *.pdf file is posted or put onto /in. For more information, see OSS event trigger.
  7. Trigger Type — OSS
  8. Trigger Name: (for example, newPDFTrigger)
  9. Bucket: (for example, my-cool-demo)
  10. Events — select oss:ObjectCreated:PostObject and oss:ObjectCreated:PutObject. When an object is uploaded to the specified bucket directory and matches the trigger rule, OSS will publish an trigger event to invoke the function code.
  11. Trigger Rule — Prefix in/ with Suffix .pdf

Invoke Function

Troubleshooting

  1. Use FC logging to troubleshoot runtime execution errors: The Log is definitely the swiss army knife you will rely on to debug any runtime error. To use the FC Log, you will need to activate Log Service, and enable the Log Configs of the FC service. Then you will be able to iteratively test the function and view logs for your function execution. Following example shows a runtime error pointing out a Node.js ReferenceError:
  1. For more information, see documentation at Log Service.
  2. Ensure permissions are authorized to FC: Following log shows an error when you don’t authorize FC with the access right to OSS.

What’s Next?

  1. Visit our product documentation of Function Compute.
  2. Learn more about FC from introductory articles on Alibaba Cloud community: Serverless Computing with Alibaba Cloud Function Compute, How to Use Function Compute on Alibaba Cloud
  3. Intelligent Media Management (IMM) service, currently available in Alibaba Cloud China site, is a more powerful SaaS tool provided by Alibaba Cloud to process media data — for example, Office file format conversion, image and video processing. It provides RESTful API for integration. FC in conjunction with IMM will offer more powerful conversion capability.

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How To Debug Wordpress Using XDebug and Visual Studio Code

I Coded My CV in ReactJS 🚀

Setup React application with minimal packages without create-react-app

Javascript: Basic Fetching

10+ Best Angular JS Tutorials For Beginners — Learn Angular Online

Write for The Javascript

Setting up A VueJS Project From Scratch (Webpack)

6 Best Cryptocurrency Trading Courses For Beginners — Learn Cryptocurrency Trading Online

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

NATS streaming server in Django

Relational Database working in 5 minutes

ExpressJS Series: Managing json configuration based express server

How to make Redis play nice with your data