Improving Data Collection on Article Clustering by Using Distributed Focused Crawler
DOI:
https://doi.org/10.32734/jocai.v1.i1-82Keywords:
data collection, cpu utilization, distributed web crawler, distributed focused crawler, focused crawler, memory utilization, multithread, web crawlerAbstract
Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.
Downloads
Published
How to Cite
Issue
Section
Copyright (c) 2017 Journal of Computing and Applied Informatics
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, copyright of the article shall be assigned to Data Science: Journal of Informatics Technology and Computer Science (JoCAI) and Faculty of Computer Science and Information Technology as well as TALENTA Publisher Universitas Sumatera Utara as publisher of the journal.
Copyright encompasses exclusive rights to reproduce and deliver the article in all form and media. The reproduction of any part of this journal, its storage in databases and its transmission by any form or media, will be allowed only with a written permission fromData Science: Journal of Informatics Technology and Computer Science (JoCAI).
The Copyright Transfer Form can be downloaded here.
The copyright form should be signed originally and sent to the Editorial Office in the form of original mail or scanned document.