Cluster Validity Indices for Detect the Optimal Clusters Feature Sets in Directive Elbow and Cluster-Wise Feature Selection

Suman Laha, Utpal Roy, Amit Kumar Saxena, Damodar Patel, Rajeshwar Prasad

PDF

Published: Jul 23, 2023

Updated: 2023-07-23

Versions:

2023-07-23 (2)

2023-07-20 (1)

Keywords:

machine learning, unsupervised feature selection (UFS), feature clustering, K-means clustering, Laplacian score, dimensionality reduction, Cluster Validity Indices (CVI), distortion index, inertia index, Silhouette index, elbow method

Suman Laha, Utpal Roy, Amit Kumar Saxena, Damodar Patel, Rajeshwar Prasad

Abstract

In this empirical model, we first detect the optimal number of clusters in unsupervised feature set using three different coefficient (CVI) measures computed under varying of K-means. To determine learned optimal number of clusters in the feature set, we use directive elbow method i.e., a graphical solution which provides the optimal point in the experimental data of versus coefficient value. Three Cluster Validity Indices (CVI) such as Distortion (DIS), Inertia (INE) & Silhouette (SIL) are being used in directive elbow method; each of them is providing respective optimal number of clusters in a feature set.

Once we have the learned optimal number of clusters in a feature set, we select the most significant feature from each cluster of features using an unsupervised feature selection method which is based on locality preserving power in terms of Laplacian Score [1] of feature.

To measure the efficiency of this model, we look for the learned optimal number of clusters in data points using directive elbow method. New class labels are provided by the K-means using learned optimal number of clusters in data points. Initial class labels (found with dataset source) were preserved. We compare classification accuracies between all features and selected features based on initial class labels and new class labels using two renowned machine learning techniques KNN & SVM classifiers.

This empirical model selects non redundant and most relevant features in unsupervised scenario. This model does not require any predefined assumption of the values of parameters such as number of cluster and number of selected features. The model has been tested on medium to high dimensional (9 to 11,000 features) twelve heterogeneous data sets and found the improvement (or no significant drop) in classification accuracy obtained on selective features.

Issue

Vol. 44 No. 7 (2023): Issue 7

Section

Articles

Author Biography

Article Sidebar

Main Article Content

Abstract

Article Details

Suman Laha, Utpal Roy, Amit Kumar Saxena, Damodar Patel, Rajeshwar Prasad