Building and Validating a Silver-Standard Bilingual Dataset for Multilabel Software Requirements Classification

Volume 15, Number 1/2

Building and Validating a Silver-Standard Bilingual Dataset for Multilabel Software Requirements Classification

Authors

Truong Dac Dien and Nguyen Thi Xuan Huong, University of Information Technology – VNUHCM, Vietnam

Abstract

This paper presents a reproducible workflow for building and validating a silver-standard bilingual dataset for multilabel software requirements classification. The corpus contains 8,832 aligned EnglishVietnamese requirement pairs with 12 labels, while a separate gold benchmark contains 622 aligned pairs. A blind audit of 500 silver instances gives a macro-averaged Cohen's kappa of 0.7427, indicating generally reliable annotations, with lower agreement for broader labels such as Look & Feel and Operability. Classical and transformer-based models are trained on the silver corpus, with threshold tuning performed only on silver validation folds. On the English gold benchmark, RoBERTa-base obtains the best micro-F1 without tuning (0.7766 ± 0.0040). On Vietnamese gold, PhoBERT-base-v2 performs best after tuning (0.7433 ± 0.0091). Threshold tuning benefits classical models more than transformers. Overall, silver data supports scalable training, but gold data remains necessary for reliable comparison.

Keywords

Requirements engineering, multilabel classification, silver-standard dataset, bilingual dataset, threshold calibration.