Application of K-Means Clustering on Apache Flume-based Multimedia Data Collection System
Main Article Content
Abstract
The exponential growth of heterogeneous multimedia data such as audio, video, images, PDFs, and documents across distributed systems has made real-time data ingestion and classification increasingly essential. This research explores an integrated framework combining Apache Flume and K-Means clustering for efficient collection and intelligent categorization of multimedia files. Apache Flume is used to ingest data from various sources, while K-Means organizes this data based on extracted metadata features. A synthetic dataset was generated to simulate real-world data streams and evaluate clustering efficiency. The proposed system architecture, methodology, results, and visual analytics are presented, demonstrating how lightweight metadata can be effectively used to automate the preprocessing and organization of diverse data types. Proposed paper Support for real-time streaming analytics and integration with deep learning models are future improvements.