Volume 15, Number 2/3

A Comparative Analysis of Data Mining Methods and Hierarchical Linear Modeling Using PISA 2018 Data


Wenting Weng1 and Wen Luo2, 1Johns Hopkins University, USA, 2Texas A&M University, USA


Educational research often encounters clustered data sets, where observations are organized into multilevel units, consisting of lower-level units (individuals) nested within higher-level units (clusters). However, many studies in education utilize tree-based methods like Random Forest without considering the hierarchical structure of the data sets. Neglecting the clustered data structure can result in biased or inaccurate results. To address this issue, this study aimed to conduct a comprehensive survey of three treebased data mining algorithms and hierarchical linear modeling (HLM). The study utilized the Programme for International Student Assessment (PISA) 2018 data to compare different methods, including non-mixedeffects tree models (e.g., Random Forest) and mixed-effects tree models (e.g., random effects expectation minimization recursive partitioning method, mixed-effects Random Forest), as well as the HLM approach. Based on the findings of this study, mixed-effects Random Forest demonstrated the highest prediction accuracy, while the random effects expectation minimization recursive partitioning method had the lowest prediction accuracy. However, it is important to note that tree-based methods limit deep interpretation of the results. Therefore, further analysis is needed to gain a more comprehensive understanding. In comparison, the HLM approach retains its value in terms of interpretability. Overall, this study offers valuable insights for selecting and utilizing suitable methods when analyzing clustered educational datasets.


Data Mining, Clustered Data, Mixed-effects, Random Forest, HLM, Hierarchical Linear Modeling, PISA