CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

E. Laxmi Lydia; G. Jose Moses; Vijayakumar Varadarajan; Fredi Nonyelu; Andino Maseleno; Eswaran Perumal; K. Shankar

doi:10.22452/mjcs.sp2020no1.8

Authors

E. Laxmi Lydia Computer Science and Engineering, Vignan's Institute of Information Technology, India Corresponding Author
G. Jose Moses Computer Science and Engineering, Raghu Engineering College (Autonomous), Visakhapatnam (Andhra Pradesh), India
Vijayakumar Varadarajan School of Computer Science and Engineering, The University of New South Wales, Australia
Fredi Nonyelu Briteyellow Ltd, United Kingdom
Andino Maseleno STMIK Pringsewu, Lampung, Indonesia
Eswaran Perumal Department of Computer Applications, Alagappa University, Karaikudi, India
K. Shankar Department of Computer Applications, Alagappa University, Karaikudi, India

DOI:

https://doi.org/10.22452/mjcs.sp2020no1.8

Keywords:

Text Mining, Hadoop MapReduce, Indexing, Lucene, Clustering, NMF, K-means

Abstract

Bigdata is a challenging field in data processing since the information is retrieved from various search engines through internet. A number of large organizations, that use document clustering,fails in arranging the documents sequentially in their machines. Across the globe, advanced technologyhas contributed to the high speed internet access. But the consequences of useful yet unorganized information in machine files seemto be confused in the retrieval process. Manual ordering of files has its own complications. In this paper, application software like Apache Lucene and Hadoop have taken a lead towards text mining for indexing and parallel implementation of document clustering. In organizations, it identifies the structure of the text data in computer files and its arrangement from files to folders, folders to subfolders, and to higher folders. A deeper analysis of document clustering was performed by considering various efficient algorithms like LSI, SVD and was compared with the newly proposed updated model of Non-Negative Matrix Factorization. The parallel implementation of hadoopdevelopedautomatic clusters for similar documents. MapReduce framework enforced its approach using K-means algorithm for all the incoming documents. The final clusters were automatically organized in folders using Apache Lucene in machines. This model was tested by considering the dataset of Newsgroup20 text documents. Thus this paper determines the implementation of large scale documents using parallel performance of MapReduce and Lucenethat generate automatic arrangement of documents, which reduces the computational time and improves the quick retrieval of documents in any scenario.

CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

Most read articles by the same author(s)

Editorial Information

Scope

Submission Guidelines

Indexing

Article Publication Charge

Journal Template

Special Issue

In Press Publication

Awards

Information

Conference

Articles

Top Cited Articles

Most View Articles

Publishing Timeline