Subscribe to Impira's blog
Stay up to date with all things Impira, automation, document processing, and industry best practices.
Impira has a unique approach to machine learning that allows users to provide feedback and see updated predictions in real time. Could you describe some of the challenges that came with creating that approach?
We think about the engineering challenges in terms of three main pillars: The machine learning (ML) itself, the user experience of the product, and the database technology required to power them. We designed these pillars to create a one-shot learning experience for our users. We’ve put a ton of investment into designing a user experience that puts the users’ data front and center.
Our team has built a lot of database technology as well — particularly incremental execution, which I can go into later. In essence, incremental execution helps our system operate in an ongoing and reactive manner, and it’s something not many databases support.
Could you expound on what incremental execution is and how Impira utilizes it?
Incremental execution is the notion that, instead of just being able to ask a database a query in SQL (or whatever query language it’s using) at a specific moment in time, you can ask the database how that question has changed over time. For example, you might ask, “From this set of invoices, how much money does each of my clients owe?" Incremental execution lets you ask a slightly different question: "How have the invoices uploaded in the last day affected the amounts my clients owe?" Instead of then doing work proportional to all files in the system, it'll only need to look at the data related to the invoices that came in within that period. In the real-time case, it takes a look at bite-sized, or incremental, sets of data at a time, within seconds of a write happening in the system. Viewed a little bit differently, it means you can treat any query as a set of input events (e.g. changes to the set of invoices and clients) that gets translated into an output event stream (e.g. changes to how much each client owes). We use this capability to power all kinds of features in our infrastructure, from query materialization to continuous import/export to client-side live updates and more.
This is hugely helpful in terms of machine learning at Impira. When there’s a continual flow of user feedback in the form of labels being confirmed or corrected in the product, new fields being added or removed, and new properties of uploaded files being discovered in the background, incremental execution is what allows these processes to happen. It adds a solid base for us to be able to feed the stream of different kinds of events into the machine learning model, which can then update itself on the fly and go reprocess previous predictions based on new learnings. I don't think we'd have been able to build such a flexible, reactive system without it.
Now that we know what it is, what does implementing incremental execution look like in Impira?
As some of our users know, Impira has IQL (which is our query language) that includes a query processing system down to the storage layer. So, implementing incremental execution was a matter of extending IQL to be able to take any query and answer any questions like the ones I described earlier.
One of the biggest challenges with incremental execution is supporting queries that have joins or group expressions in them. Projections and filters are relatively straightforward because those can be run on the event stream itself. For example, if you have events coming in and let’s say you only care about files with a particular file name, you can just filter that event stream directly to only see events on the files with that file name.
However, if you’re doing something like grouping by file name and running some sort of aggregate, then the event stream has to be grouped as well. Those events get funneled to the right aggregate, and the aggregates need to know how to update themselves as those events come in.
If you’re looking at joins, it gets even more complicated. Let’s say you have events coming in on the inside of some join that’s running. That means any time an event on the inside comes in, you need to figure out which rows on the outside it actually affects. In essence, you’re running the join in reverse.
We’ve integrated incremental execution into the query processing runtime of IQL itself, all the way down to the storage layer. So, any IQL query you can run, you can run incrementally.
What does the team dynamic look like when you’re all working through these sorts of problems and implementations?
It’s a very collaborative process, and we like to view everything as an experiment that we can chip away at day-to-day, and be flexible with reprioritizing things as progress gets made.
When we started out, we had no idea if incremental execution was going to work. It started off as a bit of a pilot project where we built it into IQL. After some time, it became clear that this was not only going to work, but it was going to help a lot in other areas of the code base. It was pretty exciting to brainstorm all these different uses for it and start to imagine how useful of a building block it would become, so we expanded our effort from there. It blossomed into this tool that provides a lot of value across the engineering team.
How is creativity fostered and how are new ideas explored at Impira?
I love to give people space to test their ideas out, run benchmarks, and write prototypes. At some point, we sit down to evaluate the findings and choose the best route forward.
All the engineers at Impira have their unique affinity and focus within the team. Some people really love working on machine learning, UX, or databases. But one thing I really love about the team at Impira is that pretty much everyone shifts across those boundaries from time to time, and continually looks for opportunities to do that. Most of our machine learning engineers have put effort in on the infrastructure and the UX, and vice versa.
We have a daily meeting called the “War Room,” where we discuss and think through these ideas on the database side. If an idea has merit, or someone is really passionate about something, we’ll set aside time to explore this idea and see how it works out.