Analysis of DNA Sequence Classification Using Machine Learning Techniques
Main Article Content
Abstract
Introduction: DNA serves as the fundamental genetic blueprint of life. Extracting meaningful patterns and information from DNA sequences is essential for advancements in genomics and comparative biology.
Objectives: This study aims to classify and distinguish DNA sequences from different species, human, chimpanzee, and dog using machine learning techniques to evaluate their effectiveness in genomic sequence analysis.
Methods: Three datasets consisting of DNA sequences from humans, chimpanzees, and dogs were used. Preprocessing included k-mer analysis and the CountVectorizer technique. Various machine learning algorithms, Naive Bayes, SVM, KNN, Decision Trees, and Random Forests were employed, alongside a Convolutional Neural Network (CNN) for deep learning-based classification. K-mers ranging from 5 to 8 were tested, with 6-mers yielding the best results.
Results: Naive Bayes achieved accuracies of 80% for human DNA, 87% for chimpanzee DNA, and 68% for dog DNA using 6-mers. CNN provided enhanced performance with 90.76% accuracy for human data, 83.64% for chimpanzee data, and 76.53% for dog data.
Conclusions: The use of 6-mers significantly improves classification accuracy across species. CNN models outperform traditional machine learning classifiers, demonstrating the potential of deep learning in genomic sequence analysis.