Parisa Safikhani and David Broneske, DZHW, Germany
Recent advancements in Automated Machine Learning (AutoML) have led to the emergence of Automated Natural Language Processing (AutoNLP), a subfield focused on automating NLP model development. Existing NLP toolkits provide various tools and modules but lack a free AutoNLP version. To this end, architecting the design decisions and tuning knobs of AutoNLP is still essential for enhancing performance in various industries and applications. Therefore, analyzing how different text representation methods affect the performance of AutoML systems is an essential starting point for investigating AutoNLP. In this paper, we present a comprehensive study on the performance of AutoPyTorch, an open-source AutoML framework with various text representation methods for binary text classification tasks. The novelty of our research lies in investigating the impact of different text representation methods on AutoPyTorch's performance, which is an essential step toward transforming AutoPyTorch to also support AutoNLP tasks. We conduct experiments on five diverse datasets to evaluate the performance of both contextual and noncontextual text representation methods, including one-hot encoding, BERT (base uncased), fine-tuned BERT, LSA, and a method with no explicit text representation. Our results reveal that, depending on the tasks, different text representation methods may be the most suitable for extracting features to build a model with AutoPyTorch. Furthermore, the results indicate that fine-tuned BERT models consistently outperform other text representation methods across all tasks. However, during the fine-tuning process, the fine-tuned model had the advantage of benefiting from labels. Hence, these findings support the notion that integrating fine-tuned models or a model fine-tuned on open source large dataset, including all binary text classification tasks as text representation methods in AutoPyTorch, is a reasonable step toward developing AutoPyTorch for NLP tasks.
Automated Machine Learning (AutoML), Automated Natural Language Processing (AutoNLP), Contextual- and non-contextual text representation, AutoPyTorch, Binary Text Classification, One-hot encoding, BERT, Fine-tuned BERT, Latent Semantic Analysis (LSA).