Join us October 22nd to hear Coglate-Palmolive, IDC, and Sequoia Capital discuss moving to a digital-first environment
Learn more

Check this out: How we built the checkbox classifier

Automatically recognizing checkboxes and determining whether they are checked is necessary to fully automate the extraction of data in certain documents like ACORD forms. As humans, we instantly understand checkboxes visually but doing this automatically requires combining computer vision with Impira's existing extraction algorithms.

Subscribe to Impira's blog
Stay up to date with all things Impira, automation, document processing, and industry best practices.

Intro to checkbox extraction

Impira is a machine learning platform that extracts key data from documents like insurance certificates and bills of lading. After many customer requests, we built the checkbox classifier so users can automatically extract the status of checkboxes - checked, unchecked, or not present.

Figure 1: ACORD forms, used in insurance and healthcare often have checkboxes.

Impira’s checkbox extraction feature relies on two different ML (machine learning) models, which operate in series to produce more accurate results than either standalone model. At evaluation time, the role of the first model is to predict an approximate bounding box region of where it believes the checkbox might be in a given document - a technique typically referred to as region proposal. This region is then passed to the second ML model, known as the checkbox classifier. The classifier is a convolutional neural network with two jobs: 1.) determine the state of the checkbox in the region (i.e. checked/unchecked) and 2.) find the precise bounding box around the checkbox.

The region proposal model that Impira uses is a probabilistic model that analyzes all the documents in a given collection (both labeled and unlabeled) in order to find words or phrases which can serve as reliable reference points when making field predictions. This model is a foundational building block used in several other types of field extraction. It would warrant a blog post of its own to adequately cover it. Essentially, the checkbox classifier builds on top of this model to classify the checkbox and find corrections to the original bounding box predicted by the region proposal model. The remainder of this post will focus on the checkbox classifier’s model architecture, outputs, and data generation methods used during training.

Model architecture

The checkbox classifier, the second model in the checkbox extraction pipeline, is a two-headed convolutional neural network that inputs an image crop and outputs two arrays, each containing three values. The first output array corresponds to a probability distribution across the different possible checkbox states, and the second output array is a bounding box prediction of the checkbox location in the image. Example output is shown in figure 2 below.

Figure 2: The classifier consists of two heads connected to a shared convolutional feature map, and outputs predictions of the checkbox state as well as its bounding box.

Classification head

There are three possible classes that the classifier head of the model can choose from, and they correspond to the following labels:

  1. Unchecked

  2. Checked

  3. No value (i.e., no checkbox present)

See figure 4 for example images of each of the above three labels. The last category is essential for two reasons. First: it’s possible the checkbox is simply not present in some of the documents in a given collection, and so “No value” is actually the correct prediction. Secondly, the region proposal model generating the input image crop may occasionally make an incorrect prediction, so it’s important for the classifier to recognize when this happens and convey that information to the user. No ML model will ever be perfect, and so the capacity to inform the user of shaky predictions so they can correct them and further train their model is one of the most powerful features Impira offers.

Bounding box regressor head

The second head of the neural network is a bounding box regressor responsible for predicting the exact checkbox location within the image. The region proposal model that comes before the classifier can be thought of as a quick and dirty way to narrow in on the checkbox location. However, its predictions are a rough approximation and may be off-center, the wrong size, or contain a lot of extra whitespace. Thus the bounding box regressor head of the classifier is essentially a correction to the originally proposed bounding box.

The bounding box regressor outputs three values defined by the following set of equations, respectively:

Where xc and yc denote the x and y coordinates of the center of the image crop, respectively, and wc is the width of the image crop (which is always a square). Similarly, the variables xc, ya, and wa are the x and y center coordinates and width of the predicted bounding box of the checkbox. Put another way, tx and ty can be thought of as the normalized predicted displacement of the checkbox’s center from the image center, and tw is the normalized predicted width of the checkbox.

Figure 3: The black variables denote the center and size of the original image crop, whereas the blue variables denote the ground truth bounding box of the checkbox. Note that (0, 0) is assumed to be in the top left of the image.

The above normalized coordinates are loosely based on the parameterized equations used in the Faster R-CNN paper (link), although the notation has been changed slightly. In addition, we’ve added the constraint that all bounding boxes should have matching widths and heights. While there are exceptions, generally speaking this is true for most styles of checkboxes. Enforcing the bounding box to always have the same shape results in a cleaner appearance within the app (as opposed to differently-shaped rectangles for all predictions in a given field). Additionally, removing this degree of freedom in the output often allows the model to more quickly converge towards an optimal solution, resulting in faster training times. 

Model training and data generation

Loss function

During training, the loss function is a linear combination of two separate loss functions for the classification and bounding box regressor. In particular, we use cross-entropy loss for the classification layer and mean squared error for the bounding box prediction:

Where yi is the (one-hot encoded) class labels, ŷi is the softmax classifier layer’s output, and ti and t̂i are the ith component of the ground truth and predicted bounding boxes, respectively. Note that the entire loss function gets averaged with respect to batch size as well. The hyperparameter λ is a constant that determines how much the model should prioritize training the bounding box regressor vs. the classifier.

Training data

At the heart of the model is the process by which new data gets generated during training time. The raw training data consists of zoomed out image crops centered on checkboxes, as well as other random crops that don’t contain checkboxes. Taking random crops at training time enables the model to not only learn how to identify if a checkbox is present anywhere in an image, but also predict the correct bounding box location of the checkbox.

Figure 4: The raw training data consists of image crops centered on checked and unchecked checkboxes as well as crops not containing checkboxes, denoted by labels 0, 1, and 2, respectively. At training time, random subcrops of these images are taken and augmented.

During training time, the model uses a data generator that takes random crops of the above raw training images (ensuring that the resulting crop at least partially overlaps with the ground truth bounding box of the checkbox). Then, the normalized bounding box coordinates (tx, ty, tw) are calculated and saved so that the loss can be calculated after forward propagation. Various forms of image augmentation that are typical of real uploaded documents are also applied to the image crops, like variations in contrast, translation/scale/rotational changes, and image compression noise artifacts. A new set of augmented image crops is generated after every epoch to ensure that the model never sees the same image twice during training, resulting in a highly accurate and robust model.

Below are some examples of the random image crops generated during training. Notice that the data generator will sometimes create image crops which occlude part of the checkbox.

Figure 5: By taking random crops and applying several forms of image augmentation, a wide variety of training images can be generated from a single raw training image.

Fine-tuning the classifier on a per-field basis

Every checkbox field relies on a master pre-trained classifier that has been diligently trained offline with massive amounts of curated training data, through the training and data generation scheme described earlier. While this master model is usually good enough for most fields, with accuracies typically exceeding 99%, occasionally it encounters particular checkboxes which are difficult to correctly classify. In these situations we create a unique copy of the classifier which is further trained, or fine-tuned, on the user-provided labels to better learn that specific field. The algorithm and training scheme used for fine-tuning is much more advanced and outside the scope of the present discussion (it will likely be the subject of a future blog post!).


The classifier model discussed in this article is responsible for looking at an image crop and 1.) determining if a checkbox is present as well as its state, and 2.) outputting the bounding box region where the checkbox is located in the image. This is accomplished by a two-headed convolutional neural network, where one head is responsible for classification across three states (checked, unchecked, and not a checkbox), and the other performs bounding box regression. The raw training data consists of zoomed out image crops centered on checkboxes, and during training time random sub-crops help train the bounding box regressor head. With a healthy amount of image augmentation, the result is a robust classifier model that can accurately locate and classify checkboxes for a wide variety of documents.

Add one-shot ML to your data extraction workflow