Academy & Industry Research Collaboration Center (AIRCC)

Volume 11, Number 12, August 2021

Leveraging of Weighted Ensemble Technique for Identifying Medical Concepts
from Clinical Texts at Word and Phrase Level

  Authors

Dipankar Das and Krishna Sharma, Jadavpur University, India

  Abstract

Concept identification from medical texts becomes important due to digitization. However, it is not always feasible to identify all such medical concepts manually. Thus, in the present attempt, we have applied five machine learning classifiers (Support Vector Machine, K-Nearest Neighbours, Logistic Regression, Random Forest and Naïve Bayes) and one deep learning classifier (Long Short Term Memory) to identify medical concepts by training a total of 27.383K sentences. In addition, we have also developed a rule based phrase identification module to help the existing classifiers for identifying multi- word medical concepts. We have employed word2vec technique for feature extraction and PCA and T- SNE for conducting ablation study over various features to select important ones. Finally, we have adopted two different ensemble approaches, stacking and weighted sum to improve the performance of the individual classifier and significant improvements were observed with respect to each of the classifiers. It has been observed that phrase identification module plays an important role when dealing with individual classifier in identifying higher order ngram medical concepts. Finally, the ensemble approach enhances the results over SVM that was showing initial improvement even after the application of phrase based module.

  Keywords

Medical Concepts, Phrase Identification, Ensemble, Machine Learning.