Intelligent Document Processing

Intelligent Document Processing (IDP) refers to the application of cognitive techniques based on Artificial Intelligence and Machine Learning in general to complement more traditional Robotic Process Automation (RPA). Those techniques provide automation capabilities that go beyond the more simple, routine and stable processes currently streamlined by RPA solutions today and create genuine additional business value for clients. SuccessData is at the forefront of innovation in the IDP space.

Extract complex relationships

In contrast to more traditional approaches focused on text only, SuccessData understands relations conveyed jointly via textual, structural, tabular, and even visual expressions by using new deep-learning techniques to automatically capture the representation (in other words the features) needed to learn how to extract those relationships from richly formatted data. We turn domain expertise and document understanding based on multiple modalities of information, first into meaningful signals of supervision, and then finally into predictive extraction results.

No hand-labelling

SuccessData changes the paradigm from labeling by hand to labeling automatically making AI more broadly practical. We use programmatic supervision to build training sets using heuristic functions which completely mitigates the key pain point for most ML implementations as we need up to 100x less training data than other traditional supervised machine learning solutions. This approach allows a fundamentally faster, more flexible, and much higher quality end-to-end ML development and deployment process.

Get more than raw data

SuccessData’s unique model retrieves not only predefined data points but also contextual information on the data extracted such as where it was found in the original document and a confidence level for each data point extracted.

Integrate your reference data

SuccessData exposes a set of APIs to facilitate the integration of your own reference data so that the output data can be enriched, cross-referenced and/or reconciled.

How does SuccessData create a new extraction model?

Define the specific data points (name, date, entity, tables, etc.) that you need to retrieve

Train a Machine Learning model on a subset of the documents (text, PDFs, articles, web pages, etc.)

Once the ML is ready and deployed, send the documents to our hosted infrastructure or process the documents locally

Retrieve a JSON/XML result file containing the extracted data points in a structured form via an API call

Behind the scene

Using a traditional supervised learning approach of machine learning, input data fed to a machine learning system has to be hand-labeled by subject-matter experts. The human-crafted labels help the machine learn to interpret and classify data, however the cost of labeling training sets has become very significant if not prohibitive in some cases, the task is extremely time-consuming for humans meaning that weeks or months have to be spent working on this, and as applications and use cases shift, training sets relevance depreciates. SuccessData instead lets a team of subject matter experts write functions that automatically assign labels to datasets.

A generative neural network then compares which labels multiple functions generate for the same data, resulting in probabilities being assigned as to which labels may be true. That data and its probabilistic labels are then used to train a predictive model, instead of using hand-labeled data. The approach is known as “weak supervision” in contrast to more traditional supervised machine learning techniques.


Which languages does the solution handle?

Technically our platform supports any language. By default, our models have been trained on Latin languages (English of course, French, German, etc.).

Can you process automatically complex tables?

Absolutely - this is actually what the system has been designed for. We are able to understand the context of a table in a document (footnotes, units, currency, etc.) as well as recognise accurately aggregations or sub-totals for example.

Do you handle checkboxes?

Yes - we can recognise whether a checkbox is ticked or not and integrate the result in our processing.

Which output format do you support?

We can produce JSON or XML as well as simpler CSV files.

Where can this be deployed?

The solution can either be hosted by us on a public cloud (currently AWS and Azure, with others current in the pipeline - there are no dependencies on any of the cloud providers’ infrastructure in the code) or deployed “on-prem” (client’s datacenter or hybrid cloud). This really depends on the client’s confidentiality and security constraints: documents that are publicly available can be processed on the cloud, while for documents that are confidential and cannot leave the client’s premises, we can deploy the solution “locally”.

Do you store documents and processing results?

We only store data during the processing and shortly after so you can access results directly, but we do not provide long term storage capabilities, and we usually purge all data every 12 or 24h for example.

Can I train the models myself?

(Currently at least) no: the set of programmatic supervision techniques that we use are not client-facing yet, however they allow us to update, adjust and re-train models extremely quickly and efficiently.

Does the model learn by itself?

Yes - we capture every modification or adjustment made manually to the results and use this information to either improve our internal classifiers or re-train the model itself.

Discover how SuccessData automates your manual processes and integrates seamlessly with your existing workflows

SuccessData abstracts away the complexity of the actual extraction process, offering a scalable infrastructure, providing speed and lower costs.