Chen Lin, Piush Kumar Singh, Yourong Xu, Eitan Lees, Rachna Saxena, Sasidhar Donaparthi and Hui Su, Fidelity Investments, USA
In this paper, we propose using domain adaptation to improve the generalizability and performance of LayoutLM, a pre-trained language model that incorporates layout information of a document image. Our approach uses topic modelling to automatically discover the underlying domains in a document image dataset where domain information is unknown. We evaluate our approach on the challenging RVL-CDIP dataset and demonstrate that it significantly improves the performance of LayoutLM on this dataset. Our approach can be applied to other NLP models to improve their generalization capabilities, making them more applicable in real-world scenarios, where data is often collected from a variety of domains.
LayoutLM, Domain Adaptation, Automatic Domain Discovery, Topic Modelling, RVL-CDIP