Get our latest book on "Top 10 artificial intelligence myths."
Download

Last updated

March 3, 2021

Table of contents

Getting started with extraction

In this article, we'll walk through the process of training machine learning (ML) models to recognize and extract text and checkboxes from any file.

We can do this by following four simple steps.

Step 1: Upload your files

You can add files by uploading them directly from your browser or importing them from an Amazon S3 bucket or Dropbox. Once uploaded, your files will appear in All Files.

Step 2: Create a collection

Next, select the files from which you want to extract data. You can do this by selecting the files from the All Files and adding them to a new collection, a folder that holds and organizes your different file types. Name your collection and save it.

Step 3: Select data from a file

Double-click on a file to open the File View.

Text Extraction

Start by highlighting the text that you want to extract (e.g., name, date, or address). That'll bring up the add field panel on the right-hand side of the File View. You can choose between numeric, date, and text types for the column. 

Name your new field, select Add Field, and Impira AutoML will immediately create a new text extraction column and attempt to identify and extract the same data across the rest of the files within that collection.

You can also highlight text and assign it to an existing field, which will immediately retrain the ML model and it will then apply what it learned to the other files in your collection. 

Checkbox Extraction

To extract a checkbox, start by drawing a box around the checkbox that you want to extract. As with text extraction, this will bring up the add field panel on the right hand side of the File View. You can choose the name for that new field and select the checkbox extraction type.

Once you press Add Field, Impira will immediately create a new checkbox extraction column and attempt to identify and extract that checkbox in the rest of the files in the collection. 

Impira’s checkbox extraction is trained to correctly identify checkboxes, radio buttons, and other yes/no indicators. 

As with text extraction, you can also draw a box around a checkbox and assign it to a new field, which will immediately retrain the ML model and it will then apply what it learned to the other files in your collection. 

Step 4: Verify and confirm your extractions

Once you're done, you can close the File View and go back to the table view. (Don’t worry, all your results are always saved.) The ML model used your clicking and extracting to learn what data to identify in the rest of your files. You can open File View for your other files and review and confirm those predictions to verify that those are correct and improve the accuracy of the ML models. 

You will notice three colors of text when it comes to ML columns. Here's what they mean and what you can do about it:

  • Green: High confidence prediction. No further validation needed.
  • Red: Review recommended prediction. You may need to manually check and confirm the value. Upon confirming, the ML model will retrain itself and redo its search for the right data. 
  • Black: An extraction you personally made. No further validation needed.

That's it. You can repeat this process as many times as you need to add more fields until you have everything you need. 

For more tips and tricks about how to use ML, read more here.