Analysis of DNA Sequence Classification Using Machine Learning Techniques

Main Article Content

Madhushree Meghavath K B, Indrakumar K, Sharath H S, Mohammed A.S Al-Mohamadi, Vighnesh H Y, Chiranjeevi

Abstract

Introduction: DNA serves as the fundamental genetic blueprint of life. Extracting meaningful patterns and information from DNA sequences is essential for advancements in genomics and comparative biology.


Objectives: This study aims to classify and distinguish DNA sequences from different species, human, chimpanzee, and dog using machine learning techniques to evaluate their effectiveness in genomic sequence analysis.


Methods: Three datasets consisting of DNA sequences from humans, chimpanzees, and dogs were used. Preprocessing included k-mer analysis and the CountVectorizer technique. Various machine learning algorithms, Naive Bayes, SVM, KNN, Decision Trees, and Random Forests were employed, alongside a Convolutional Neural Network (CNN) for deep learning-based classification. K-mers ranging from 5 to 8 were tested, with 6-mers yielding the best results.


Results: Naive Bayes achieved accuracies of 80% for human DNA, 87% for chimpanzee DNA, and 68% for dog DNA using 6-mers. CNN provided enhanced performance with 90.76% accuracy for human data, 83.64% for chimpanzee data, and 76.53% for dog data.


Conclusions: The use of 6-mers significantly improves classification accuracy across species. CNN models outperform traditional machine learning classifiers, demonstrating the potential of deep learning in genomic sequence analysis.

Article Details

Section
Articles