Digitising Documents With AI

24 Feb 2020

One of the biggest challenges of digitising one’s organisation/business is how to deal with your old records and documents. You will most likely be required to keep all information on these documents for legal reasons, and you may want the information stored on your new system to provide further insights. Although it may be ideal to add your historical information onto your new system, it does take a massive amount of time and effort to go through your documents and manually capture all the information that they contain. Furthermore, manually saving this information could lead to human error, resulting in information being captured incorrectly. To help ease the process of capturing data from old records, we wanted to look into artificial intelligence solutions which could do the job faster and more accurately.

This allowed us to experiment with the Cognitive Services available on Azure, more specifically the Form Recogniser service. This service lets us build a machine learning model to accurately extract text, key-value pairs and tables from documents. The service creates a model that is tailor-made for your document layout, allowing for fast and accurate data extraction from the document. This service is still in Preview at the time of writing, but we were fortunate enough to get introduced to this service at the Microsoft Ignite conference.

The Test Case

The first thing we need, before we can begin using the service, is to gather a large number of documents containing valuable data. We decided to try extract information from invoices, as they generally need to contain a lot of detailed information and are typically stylised for the issuing organisation. We ended up creating a few short scripts to generate the details of twenty fake clients, each with their own contact person and company details, fake invoice information, with randomised line items and rates, and finally the PDF versions of these fake invoices. We ended up creating 455 invoices to test as we felt that this would be enough to determine the ease of use of this service, as well as its performance with a high data load. An example of the generated invoices can be sen in the image below.

An Example Invoice Used In This Experiment

Training The Model

Now that we have our documents, it’s time to use the Form Recogniser service to build our model to extract the information from our documents. To do this we need to upload some training data to an Azure Storage Account for the Form Recogniser service to access. Amazingly, the service doesn’t require a lot of data to be trained, only five documents are required to produce an accurate model. Once our data is uploaded, we simply send a request to our Form Recogniser’s API, and the model begins training on our documents. This produces a unique model which is specifically built to cater to the layout of the documents we uploaded. The service allows us to create several different models, each identified by a random key which is produced after training is completed.

Extracting Our Data

It is finally time to see the Form Recogniser service in action. We use the model key, produced when training our model, to request the service extract all the information is can from a document. This document is sent to the Azure service and the service returns all the data it was able to successfully obtain from the document. Using our invoices, the service returns a list of key-value pairs as well as the data contained inside the table. From our document, the service returned;

Verbatic’s details.
The invoice number, invoice date, etc.
The clients billing details.
The project details.
The various totals presented at the bottom right of the page.
All information regarding the table, such as column headings and cell values.
For those who are interested, we included an example of the results we received from the service at the end of this post.

Now, we must admit that the service is not perfect. It did make one mistake when analysis our documents, with our company’s address, thinking the first line, “14 Panfluit Street, Unit G409” was the identifier for the rest of our company’s information below. It is worth noting however that this is not a major issue at all though, as we can do some data manipulation to correct this mistake.

To finish off this experiment, we built a small data pipeline in python, which would send each file to our Form Recogniser model and save the results in a MySQL database. The data can now be used by any reporting/dashboarding tool to extract valuable insights into the organisation’s performance over time.

Final Thoughts

The Form Recogniser is an amazing tool for extracting data from “difficult to handle” file types, allowing us to unlock a new wealth of information. The service was quick to set up and use, allowing us to set up this experiment in a matter of hours. In fact, the hardest part of this exercise was creating all the fake PDFs we needed.

We definitely did not come close to utilising the full power of this service in the Azure ecosystem. For example, by combining this service with some of the tools Azure has to offer, we could fully automate this process whenever a new file is stored in the cloud. This means you could be able to digitise your business without too much disruption to your current workflow.

We look forward to working with our clients on this technology, helping them along their digital transformation journey and getting the most out of their historical data.

Example Result

For the fellow developers reading this, we have included a slimmed-down version of the output produced by the service. The model returns the result in a JSON format, with the data being split into key-value pairs and table objects. We have removed some of the properties, such as the bounding boxes for each data point as well at the model’s confidence score just to reduce the size of the object for display purposes.

{
  "keyValuePairs": [
    {
      "key": [{
        "text": "BILL TO"
      }],
      "value": [
        {
          "text": "Stephanie Mtshali"
        },
        {
          "text": "Ochse CC"
        },
        {
          "text": "31689 Dlamini Lodge"
        },
        {
          "text": "Sonyastad"
        },
        {
          "text": "Eastern Cape, 8318"
        },
        {
          "text": "+27 22 766 3730"
        },
        {
          "text": "2013/134274/87"
        }
      ]
    },
    {
      "key": [{
          "text": "Invoice No:"
      }],
      "value": [{
        "text": "#001062",
      }]
    },
    // Remaining Key-Value Pairs
  ],
  "tables": [{
    "columns": [
      {
        "header": [{
          "text": "DESCRIPTION"
        }],
        "entries": [
          [{
            "text": "Security Update"
          }],
          [{
            "text": "Deployment"
          }],
          [{
            "text": "Development"
          }]
        ]
      },
      // Remaining Table Columns
    ]
  }]
}