Learning Structured Information from Small Datasets of Heterogeneous Unstructured Multipage Invoices

David Emmanuel Katz; Christophe Guyeux; Ariel Haimovici; Bastian Silva; Lionel Chamorro; Raul Barriga Rubio; Mahuna Akplogan; smartlayers.io; David Emmanuel Katz; Christophe Guyeux; Ariel Haimovici; Bastian Silva; Lionel Chamorro; Raul Barriga Rubio; Mahuna Akplogan; smartlayers.io

doi:10.5121/csit.2022.121220

Volume 12, Number 12, July 2022

Learning Structured Information from Small Datasets of Heterogeneous Unstructured Multipage Invoices

Authors

David Emmanuel Katz¹, Christophe Guyeux², Ariel Haimovici¹, Bastian Silva¹, Lionel Chamorro¹, Raul Barriga Rubio¹ and Mahuna Akplogan¹, ¹smartlayers.io, ²Universit´e de Bourgogne Franche-Comt´e, France

Abstract

We propose an end to end approach using graph construction and semantic representation learning to solve the problem of structured information extraction from heterogeneous, semi-structured, and high noise human readable documents. Our system first converts PDF documents into single connected graphs where we represent each token on the page as a node, with vertices consisting of the inverse euclidean distances between tokens. Token, lines, and individual character nodes are augmented with dense text model vectors. We then proceed to represent each node as a vector using a tailored GraphSAGE algorithm that is then used downstream by a simple feedforward network. Using our approach, we achieve state-of-the-art methods when benchmarked against our dataset of 205 PDF invoices. Along with generally published metrics, we introduce a highly punitive yet application specific informative metric that we use to further measure the performance of our model.

Subscription Membership AIRCC CSCP Contact Us
All Rights Reserved ® AIRCC