Volume 9, Number 2/3

Cost-Sensitive Topical Data Acquisition from the Web

  Authors

Mahdi Naghibi1, Reza Anvari1, Ali Forghani1 and Behrouz Minaei2 , 1Malek-Ashtar University of Technology, Iran and 2Iran University of Science and Technology, Iran

  Abstract

The cost of acquiring training data instances for induction of data mining models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for data mining tasks. But the distributed and dynamic nature of web dictates the use of solutions which can handle these characteristics. In this paper, we introduce an automatic method for topical data acquisition from the web. We propose a new type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The new link context extraction method which is called Block Text Window (BTW), combines a text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with state of the art automatic topical web data acquisition methods based on standard metrics.

  Keywords

Cost-Sensitive Learning, Data acquisition, Topical Crawler, Link Context, Web Data Mining