Technology
/
Feb
10
,
2021
Ankur Goyal
Share
Document processing with OCR is still incredibly difficult in 2021. We explore why and introduce a disruptive solution through the use of AutoML technology.
Forms, invoices, screenshots, compiled reports, presentations, contracts, and many other documents are everywhere. They are rarely standardized and consume precious time to interpret, re-key, find, and analyze. Optical Character Recognition (OCR) has been around and evolving since the 1970s, but the problem of automating manual processes that revolve around documents is still very much unsolved.
We’ll take a brief look at the various challenges involved with OCR and why, in 2021, it’s still so difficult to automate document workflows. At Impira, we are razor focused on solving this problem, and in identifying it, we’ll also discuss our unique approach.
Extracting usable structured data from documents involves multiple steps, beyond just OCR. At a high level, there are three:
Technically OCR only covers the first step, even though the term is often used to describe the whole process. However, historically this has been the focus and, as a result, text extraction and post-processing are underdeveloped and where solutions struggle.
The original wave of OCR software (Xerox, Oracle, Kofax, ABBYY) used a geometric template approach to solve text extraction. This works if every document is the same layout and scanned perfectly, but of course this is never true in practice.
More recently, modern artificial intelligence (AI) techniques like computer vision can tolerate scanning issues and layout changes and natural language processing (NLP) can mimic human interpretation of language.
A wave of AI-based API solutions have come to market, allowing you to upload an image or document and receive back a semi-structured representation of the underlying data. These APIs offer a dramatic improvement in ease-of-use and quality over their template-based predecessors; however, they come with a couple key challenges:
Another flavor of pre-trained solution is an intelligent document processing (IDP) platform. These tools leverage the same underlying technologies as an AI-based API; however, they provide a suite of user interfaces and integrations that make them much easier to integrate into a broader workflow. Unfortunately, these tools also carry significant challenges:
What if, instead of preparing for every possible scenario, machine learning software could learn based on your documents? You’d get all of the benefits of an AI-driven approach, over any set of documents you want to work with. You would not be limited to the documents a product has encountered ahead of time and, as your own documents and data evolve, so would the algorithms that interpret them.
To date, the only way to accomplish this is to build your own machine learning models, which is expensive and requires training, evaluating, and testing to get right. There is a new movement in machine learning, however, called AutoML, which is all about automating these steps. This is an ambitious idea that if executed well results is a user experience that is dead-simple for anyone to use. AutoML has been successfully applied to a number of problems, including image labeling, time series forecasting, and entity recognition. In the context of document processing, an AutoML-based approach would allow a tool to learn the nuances of your documents on-the-fly, bringing the benefits of an AI-based approach to any set of documents you want to wrangle.
This of course is no easy task, but it is our singular focus at Impira. Our product allows any user, technical or not, to upload a set of documents, highlight a few fields, and immediately extract clean, structured data that you can export and analyze. Under the hood, each time you create a new field, you are instantiating a new model which re-trains itself with each user-provided example. The model gets automatically run across every document you upload into the platform, and its outputs are indexed for easy access. The result feels a lot like using a relational database, where instead of loading tabular data, you load in documents, which you can query just as easily.
Impira AutoML works with as little as one example, which is often enough for structured forms. Of course, the more data you provide, the more accurate the results, so Impira will prompt you to provide input when it’s not confident enough about its predictions. The platform also offers a number of tools to help you post-process data. It comes with a rich query language called Impira Query Language, or IQL, which you can use to join data across documents, split up / transform fields, and even aggregate across documents. We find that these tools save our users from writing code to further normalize the data in their documents. And finally, everything is accessible through our UI and a simple REST API, which you can use to upload files and query extracted data.
At Impira we are driven to build the easiest experience to work with unstructured data. AutoML is the technology at our core that makes this possible, and we are hard at work expanding the capabilities of our platforms to support more types of documents, newer file types, higher accuracy rates, and more integrations with the ecosystem of tools. If you would like to learn more about AutoML, please reach out to us. We would love to discuss use cases or collaborate in other ways.