Academy & Industry Research Collaboration Center (AIRCC)

Volume 12, Number 12, July 2022

Learning Structured Information from Small Datasets of Heterogeneous Unstructured Multipage Invoices


David Emmanuel Katz1, Christophe Guyeux2, Ariel Haimovici1, Bastian Silva1, Lionel Chamorro1, Raul Barriga Rubio1 and Mahuna Akplogan1,, 2UniversitĀ“e de Bourgogne Franche-ComtĀ“e, France


We propose an end to end approach using graph construction and semantic representation learning to solve the problem of structured information extraction from heterogeneous, semi-structured, and high noise human readable documents. Our system first converts PDF documents into single connected graphs where we represent each token on the page as a node, with vertices consisting of the inverse euclidean distances between tokens. Token, lines, and individual character nodes are augmented with dense text model vectors. We then proceed to represent each node as a vector using a tailored GraphSAGE algorithm that is then used downstream by a simple feedforward network. Using our approach, we achieve state-of-the-art methods when benchmarked against our dataset of 205 PDF invoices. Along with generally published metrics, we introduce a highly punitive yet application specific informative metric that we use to further measure the performance of our model.