Get our latest book on "Top 10 myths for artificial intelligence."
A female looking at lots of small illuminated numbers on a wall.

Smarter OCR with Impira AutoML

Document processing with OCR is still incredibly difficult in 2021. We explore why and introduce a disruptive solution through the use of AutoML technology.

Forms, invoices, screenshots, compiled reports, presentations, contracts, and many other documents are everywhere. They are rarely standardized and consume precious time to interpret, re-key, find, and analyze. Optical Character Recognition (OCR) has been around and evolving since the 1970s, but the problem of automating manual processes that revolve around documents is still very much unsolved.

We’ll take a brief look at the various challenges involved with OCR and why, in 2021, it’s still so difficult to automate document workflows. At Impira, we are razor focused on solving this problem, and in identifying it, we’ll also discuss our unique approach.

Framing the problem

Extracting usable structured data from documents involves multiple steps, beyond just OCR. At a high level, there are three:

  1. OCR: converting pixels to characters of text, with additional metadata including geometry, size, and page number.
Illustration of a quote on a piece of paper. OCR converts that quote into a string of text.
  1. Text extraction: identifying which text corresponds to a desired field; for example the total in an invoice, a named field in a form, or a term in a contract.
Illustration of a table from an invoice with an arrow pointing to a version of the same invoice, but with boxes and arrows labeling vendor info, total, and terms.
  1. Post-processing: processing the extracted text further, including normalization (e.g. converting “1,  23 4” to the number 1234), concatenation (e.g. combining “231” “Avery” “Lane” into “231 Avery Lane”), and interpretation (e.g. using named entity recognition to extract the diagnosis from a paragraph of text).
Illustration of how OCR detects letters and compares them to the alphabet to produce individual characters then links them into a word.

Technically OCR only covers the first step, even though the term is often used to describe the whole process. However, historically this has been the focus and, as a result, text extraction and post-processing are underdeveloped and where solutions struggle.

The original wave of OCR software (Xerox, Oracle, Kofax, ABBYY) used a geometric template approach to solve text extraction. This works if every document is the same layout and scanned perfectly, but of course this is never true in practice.

image of two forms with terms highlighted in red. On the left it shows the form straight on, on the right shows the same form but slightly tilted causing the red boxes to not align correctly.

Pre-trained AI solutions

More recently, modern artificial intelligence (AI) techniques like computer vision can tolerate scanning issues and layout changes and natural language processing (NLP) can mimic human interpretation of language.

Intelligent APIs

A wave of AI-based API solutions have come to market, allowing you to upload an image or document and receive back a semi-structured representation of the underlying data. These APIs offer a dramatic improvement in ease-of-use and quality over their template-based predecessors; however, they come with a couple key challenges:

  1. The API interface is very loosely defined and designed to support a wide variety of data, requiring a layer of code to make use of their outputs. In practice, this code (example) is difficult to write and can break easily with subtle changes across documents.
  2. These solutions are pre-trained over a corpus ahead of time. This corpus covers a lot of the standard document types, like ACORD forms, but fails to perform well on the long tail of complex documents - like order forms, invoices, purchase orders, contracts and reports - that show up in the real-world. 
a chart that describes the flexibility of forms.
Intelligent Document Processing (IDP) platforms

Another flavor of pre-trained solution is an intelligent document processing (IDP) platform. These tools leverage the same underlying technologies as an AI-based API; however, they provide a suite of user interfaces and integrations that make them much easier to integrate into a broader workflow. Unfortunately, these tools also carry significant challenges:

  1. They also leverage pre-trained models under the hood and struggle to adapt to new document characteristics that show up in the real world: varying layouts, fields, and even languages. Some vendors offer custom training as part of their offering, but these are expensive services that are slow and painful to implement.
  2. Paired with pre-trained models are pre-configured schemas that map to the fields in the underlying models. For example, a pre-trained 1040 form model would include a schema with each of the fields on a 1040 form (e.g. “Foreign country name”). While this works for standardized forms, most use cases involve non-standard forms and semi-structured documents like invoices, which have different fields across industries and even vendors.
1040 from 2019 form with terms highlighted, arrow pointing to a 1040 from from 2020, with highlights not matching the correct information, and an illustration of an engineer reprogramming the model, another arrow that points to a final 1040 form from 2020 with the correct info highlighted.

AutoML: putting learning in the hands of the user

What if, instead of preparing for every possible scenario, machine learning software could learn based on your documents? You’d get all of the benefits of an AI-driven approach, over any set of documents you want to work with. You would not be limited to the documents a product has encountered ahead of time and, as your own documents and data evolve, so would the algorithms that interpret them.

To date, the only way to accomplish this is to build your own machine learning models, which is expensive and requires training, evaluating, and testing to get right. There is a new movement in machine learning, however, called AutoML, which is all about automating these steps. This is an ambitious idea that if executed well results is a user experience that is dead-simple for anyone to use. AutoML has been successfully applied to a number of problems, including image labeling, time series forecasting, and entity recognition. In the context of document processing, an AutoML-based approach would allow a tool to learn the nuances of your documents on-the-fly, bringing the benefits of an AI-based approach to any set of documents you want to wrangle.

This of course is no easy task, but it is our singular focus at Impira. Our product allows any user, technical or not, to upload a set of documents, highlight a few fields, and immediately extract clean, structured data that you can export and analyze. Under the hood, each time you create a new field, you are instantiating a new model which re-trains itself with each user-provided example. The model gets automatically run across every document you upload into the platform, and its outputs are indexed for easy access. The result feels a lot like using a relational database, where instead of loading tabular data, you load in documents, which you can query just as easily.

Impira AutoML works with as little as one example, which is often enough for structured forms. Of course, the more data you provide, the more accurate the results, so Impira will prompt you to provide input when it’s not confident enough about its predictions. The platform also offers a number of tools to help you post-process data. It comes with a rich query language called Impira Query Language, or IQL, which you can use to join data across documents, split up / transform fields, and even aggregate across documents. We find that these tools save our users from writing code to further normalize the data in their documents. And finally, everything is accessible through our UI and a simple REST API, which you can use to upload files and query extracted data.

At Impira we are driven to build the easiest experience to work with unstructured data. AutoML is the technology at our core that makes this possible, and we are hard at work expanding the capabilities of our platforms to support more types of documents, newer file types, higher accuracy rates, and more integrations with the ecosystem of tools. If you would like to learn more about AutoML, please reach out to us. We would love to discuss use cases or collaborate in other ways.

Sign up for Impira AutoML today.