Development of Text-based Big Data Collection Technology

Main Article Content

Seung-Yeon Hwang, Jeong-Joon Kim

Abstract

Big data collection technology is a very important part of the cyber strategy field due to the technological development of the Fourth Industrial Revolution. Big data collection technology has various technologies such as Flume, Crawling, Scoop, Scribe, and Kafka, and is widely commercialized. Among them, a solution was developed directly using crawling technology, and data was collected for security-related papers among cyber strategy fields. The collected data has PDF files of papers in the form of metadata and raw data. The collected metadata is stored in MongoDB, and the paper PDF file is distributedly stored in GridFS. A total of 20,775 paper PDF files were collected from the first publication date of each conference and journal to August 2020. Based on the collected data, technology for efficient collection and long-term preservation of public data in the cyber strategy field can be secured, and actual data collected from various collection sites can be quickly accessed through metadata.

Article Details

Section
Articles