Easy and fast extraction,
4 steps away from you

How it works

Define the specific data points (name, date, entity, tables, etc.) that you need to retrieve

Train a ML model on a subset of the documents (text, PDFs, articles, web pages, etc.)

Once the ML is ready and deployed, send the documents to our hosted infrastructure or process the documents locally

Retrieve a JSON/XML result file containing the extracted data points in a structured form via an API call

Behind the scene

Using a traditional supervised learning approach of machine learning, input data fed to a machine learning system has to be hand-labeled by subject-matter experts. The human-crafted labels help the machine learn to interpret and classify data, however the cost of labeling training sets has become very significant if not prohibitive in some cases, the task is extremely time-consuming for humans meaning that weeks or months have to be spent working on this, and as applications and use cases shift, training sets relevance depreciates. SuccessData instead lets a team of subject matter experts write functions that automatically assign labels to datasets.

A generative neural network then compares which labels multiple functions generate for the same data, resulting in probabilities being assigned as to which labels may be true. That data and its probabilistic labels are then used to train a predictive model, instead of using hand-labeled data. The approach is known as “weak supervision” in contrast to more traditional supervised machine learning techniques.

SuccessData allows you to control where your data is processed – whether locally using containers so your data stays private, or on the cloud for publicly available data if this is more convenient

SuccessData achieves better than human extraction

SuccessData attains outstanding precision and quality while scaling up to very large number of documents. We leverage the latest advances in ML models to extract complex document-level information that is expressed in the form of not only free text, but also tables or in visually distinctive ways. The product is designed as an integrated workflow from data collection to the production of structured results, accessible via simple REST API calls.