The Implementation of Smart NLP-ML Educational Theoretical Content Grading System
Main Article Content
Abstract
Introduction: The education landscape is evolving rapidly with technological advancements, prompting a paradigm shift in assessment practices. Key challenges in the existing literature include the absence of a unified approach to seamlessly integrate features from ML and NLP, and a lack of empirical evaluations comparing the performance of different feature sets.
Objectives: The study focussed on the design of a hybrid grading model that harmoniously combines topic-based and word embedding-based features for the development of the NLP-ML theoretical content grading system coined “SMART-CGS”.
Methods: Primary data was obtained from a lecturer of undergraduate studies that contains the responses of students to an open-ended question in the computer science domain. The datasets were preprocessed with Natural Language Toolkit, Bidirectional Encoder Representation from Transformer (BERT) was used for the word embedding while the topic modelling was done with Latent Dirichlet Allocation. Extra-large bidirectional transformer network (XLNET) was used to benchmark BERT. Random Forest (RF) algorithm was used for the grade’s prediction. All the implementation was done using Python libraries and codes.
Results: The RF model's predictions generally show a trend of underestimating the actual grades, suggesting a conservative approach in its evaluations. The precision, recall, F1 score, and accuracy all ranging around 0.92, 0.94 for BERT and XLNET respectively, and indicate a relatively high level of overall model performance. While there is a degree of alignment between the actual and predicted grades, it is evident that the XLNET model perform better than BERT. The scalability of BERT and XLNET-based models are 47 and 78 seconds respectively when tested over different students’ responses. This shows a high throughput of executing 500 batch sizes in the part of BERT over XLNET.
Conclusions: The observed conservative grading pattern, potentially influenced by the dominance of lower grades in the dataset, suggests the need for ongoing refinement in feature engineering which prompted the usage of XLNET that gives a better result (0.94).With these results, the choice of the model to adopt depends on the speed and accuracy trade-off that existed between the two models.