Performance analysis of Distributed Least Square Twin Support Vector Machine Learning Algorithms for Big Data Analysis
Main Article Content
Abstract
Recently reported Big Data use cases deal with datasets having large number of data samples, features and classes. Modern technological advancements in computational and data storage resources have contributed heavily to the development of Big Data science. Companies such as Google, Yahoo, Microsoft, Amazon, etc. daily collect and maintain data in Exabyte scale or larger and perform intensive analysis on it to find underlying patters. Moreover, social media organizations such as Facebook, YouTube, and Twitter have billions of users that constantly generate a very large quantity of data. Some well-known application domains involving big scale datasets are recommendation systems used by companies like Netflix, Amazon, Flipkart, YouTube, Facebook, Twitter, etc., web-scale text mining and document categorization, web-scale image classification and annotation, web-click through pattern mining, fraud detection, spam detection, web log mining and many more. Traditional machine learning techniques do not fit in dealing with million and billion scale datasets and causes storage and computational bottlenecks in training task. Existing non-parallel version of Least Squares Twin Support Vector Machine (LSTSVM) is not much faster variant of well-known machine learning technique Support Vector Machine (SVM). This design exploits distributed computation on cluster of machines to provide a scalable solution to LSTSVM. Very large datasets are partitioned and distributed in the form of resilient distributed datasets on top of Spark cluster computing engine. LSTSVM is trained to generate two nonparallel hyper-planes. These hyper-planes are achieved by solving two systems of linear equations each of which involves data instances from either class. While designing DLSTSVM we employed distributed matrix operations using the Map Reduce paradigm of computing to distribute the tasks over multiple machines in the cluster. Thus, memory constraints with extremely large datasets are averted. Experimental results show the reduction in time complexity as compared to existing scalable solutions to SVM and its variants. Moreover, detailed experiments depict the scalability of the proposed design with respect to large datasets.