Mining the World Wide Web. An information search approach (Q2734774)

The book deals with the challenges of information search in the World Wide Web (WWW). In the first part of the book, the authors take an information retrieval perspective on the WWW, with focus on information encoded in textual data (written documents). They start with an overview of the techniques underlying standard, keyword-based Web search engines for unstructured textual data (including Web crawlers and meta-search engines). Then they discuss query-based search systems and Web query languages for structured data with reference to database management system methods. The authors, finally, turn to mediator, data warehouse and wrapper architectures, which integrate structured databases with the mostly non-structured or semi-structured data available in the WWW. Since the WWW also contains a wide variety of non-textual data (image, video and audio data), the authors briefly survey some methods underlying multimedia search engines.NEWLINENEWLINENEWLINEIn the second part of the book, the focus shifts to data mining on the Web. First, a survey of basic concepts and methods underlying data mining is given. Since the focus of data mining is on the extraction of information from structured data, this view has to be complemented by methods which deal with unstructured data in the WWW as stored in documents. This leads to an overview of methods dealing with text mining, i.e., knowledge discovery in documents in terms of association, trend and event discovery. With the growth of online data in the Web, the application of data mining techniques to pattern discovery in Web data has surfaced in terms of so-called Web mining. Three trends are discussed: Web content mining (the automatic discovery of Web document content patterns, i.e., text mining), Web usage mining (the automatic discovery of Web user behavior patterns), and Web structure mining (the automatic discovery of hypertext and linking structure patterns by connectivity and link topology analysis). Combing the notions of Web crawlers and agent technology has recently led to the concept of autonomous and intelligent Web crawling agents which gather information from the WWW. A case study of the architecture underlying Envirodaemon, an information search engine operating on the WWW which deals with the environmental domain, concludes the book.

0 references

reviewed by

Udo Hahn

0 references

Identifiers

zbMATH Open document ID

0989.68001

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:2734774