To many, the Internet is a valuable source of all sorts of information. Never before has an information source been available through which such massive amounts of information - and in such a broad range - can be gathered. And this information can be obtained conveniently and at very low costs, which too is something that seems unprecedented.
The same story can be told for those wanting to offer information or services through the Internet. The barriers as well as the investments needed to do this, are very low (at least when compared to other media). Literally anyone who wants to, can get their fifteen minutes of fame in "Cyberspace".
And although this has been - and it still is - one of the biggest traits of the Internet, it also has an important downside to it: there is very little supervision on the ways in which information is offered, i.e. there are no rules stating in which form information should be offered, the way in which information should be described or typified, and so on. Organisations such as the Internet Engineering Task Force and the W3C have designed various guidelines and standards to which information or documents should adhere (and they are doing continuous research into this and related areas), but to seemingly little avail. For instance, standard HTML offers the possibility to add META-tags to HTML documents. These tags can be used to convey information such as the author of the document, the date of creation, the document type, and one or more keywords that best describe the document's content. At this moment, only a small percentage of all documents available on the World Wide Web (WWW) part of the Internet, use these tags.
Search engines, currently the most popular way to search for information on the Internet1, will use these tags (when available) to classify and typify a document. But as these tags are usually missing from a document, a set of heuristics have to be used instead to classify it; besides standard data, such as the date of creation and the URL, a document is usually classified by a list of the most frequently used terms in it, or by extracting the first 50-100 words of a document.
The advantage of using such heuristics is that the whole process of locating, classifying and indexing documents on the Internet can be carried out automatically and fast2 by small programs called crawlers or spiders.
Today, this method of working is more and more showing that the medal has two sides. While this method makes it possible to index thousands of documents a day, there is a price that has to be paid for using it: the loss of detail and the lack of a comprehensive summary of a document's contents. This leads to search engines returning huge result lists as the answer to a query, lists which also contain a lot of noise, such as irrelevant, duplicate or outdated document links.
An attempt to provide more personalised and more up-to-date or even real-time information (e.g. by using databases to store and present a site's content) has and is amplifying this problem even more; at this moment, there are Web services which consist of thousands of pages, numerous of which are created ‘on-the-fly’ (so they can be filled with up-to-date/real-time information). However, sites that use such dynamic documents, cannot be properly indexed by the small indexing programs (called crawlers) that most search engines use to gather data about the information that is available on a site, as there are no complete or static documents that can be scanned and indexed. The information in those documents is hidden away from, or unavailable for, the indexing programs. (see [LYNC97])
Apart from a look at a prominent development such as "Push Technology", in this chapter we will also have a look at precursors of important (near) future concepts and applications in the context of the online market place (e.g. agent-like applications and intermediary services).
1= Usually, search engines are used to search for information on the WWW, which is why in this section we will be talking about "documents". Yet, most search engines can be used to search for other items as well, such as files or Usenet articles. To keep things simple, we will continue to talk about (Web) documents, but what is said is in most cases just as valid for those other types of information.
2= The documents being scanned do not have to be interpreted or comprehended: merely applying the mentioned heuristics is all that needs to be done.