A fundamental requirement for usability and scalability of indexing and searching on the web seems to be that there can be no single index of all information, and no single, central top to a hierarchy of all information. Any system built around a dependence cannot succeed.
There continues to be commercial value in attempting to index everything on the web, but estimates are that only a modest percentage of the actual documents have been indexed, while growth continues at an exponential rate. Meanwhile, users around the world search in these centralized indexes, or replicas of them, and the better search servers quickly become overloaded, thus reducing their effectiveness. Furthermore, the indexing that is done is very generic, and not at all specific to the domain areas of the documents, thus reducing their usefulness. All these market pressures will dictate that search servers will eventually need to specialize in various ways to hold on to an increasingly narrow niche.
It doesn't make logical sense to structure all information in a single hierarchy because the nature of information is very interconnected with no center. Furthermore, it doesn't make practical sense because the higher nodes in the hierarchy will tend to be accessed too often. Massive replication of higher nodes could alleviate the latter problem but only with a severe space and time cost.
On the other hand, a web of index nodes is a challenge to search because no one index has all the information. There is no single, central top node, though there may be several relatively local top nodes about different topics. There should be no requirement that some neighbors of a node must have more general information because this would imply the existence of a most general node which gets us back to a single, central top.
An automatic search through a web starting at some node requires that a query may be automatically continued on neighboring nodes. Neither a depth-first nor breadth-first search is appropriate because we should not attempt to search the whole web blindly. Rather, we should search only the parts of the web that have some promise of satisfying the search, so a measure of relevance for each match is required. Given a set of nodes with varying degrees of relevance, the neighbors of the most relevant nodes should be searched next. It will help if each link to a neighboring node has some hints as to what the relationship with the neighbor is, but this is not strictly necessary.
This technique is very similiar to the search algorithm presented in "How to interact with a Whois++ mesh" [Faltstrom, Schoultz, Weider]. The difference is that getting a "hit" does not mean the search is done, and so which index servers to forward queries to is based on how relevant the hits are for each referring index.
Obviously, this technique is aided by having organized information. No one can hope to organize all information even for a fairly focused subject, although the attempt to do so has immediate benefits. In addition to manual organization, we should explore how we can facilitate self-organization of the web.
In previous work, I described centralized searching of a distributed index. The need for distributed indexes is strongly argued, but the assumption of centralized searching is questioned here.
Centralized searching seems to be required to at least coordinate the concurrent searching that could be done. But can the searching be even more distributed to the point of having no center?
A query by a single user starts one place, at the originator, and the result should be returned to that one place, but what happens in the meantime can vary. Centralized control of the searching process will direct where to search next along the frontier, and determine when the search is completed. With a fully distributed search, there is no central control. Instead, there might be several second level control centers, each searching their separate domains or areas of expertise. Or the control could be distributed recursively to ever more domains. The results could be accumulated by second level control centers, or recursively accumulated from each neighbor in the web.
The advantage of a centralized search is that we avoid many of the problems to be overcome of a decentralized search. The advantages of a distributed search, that make it worthwhile to overcome the problems, involve scaling. We gain much greater processing power with each remote server performing part of the search, and we avoid the bottleneck of a centralized controller managing the entire process. The results still have to come back to the originator, so this argues in favor of decentralized accumulation of results as well.
One problem with no central control is that the interrelated, interlinked web is likely to search the same area multiple times, at least once for every interconnection to neighboring related areas. An "area" is a document or collection of semantically related material. Interconnections are to more distantly related material in other areas.
One possible solution to this is identifiers for queries. Each query that is started by an individual is identified when it is created. This identifier is passed with the query wherever it goes. That way, a collection will know that it has processed the query already by comparing to its cache of recent queries. Queries should also have a time to live on them - the time during which the query must be satisfied or all activations of the query should die.
Queries might be identified by their representation, and results from similar queries, or component queries, if the query is a composite of simpler queries, could be used to satisfy the current query.
The way a query is propagated through the web of interrelated nodes is much like propagation for replication of new information to interested subscribers. Here the query is replicated to "interested" information nodes that wish to be discovered by relevant new queries. The nodes could remember the queries as a knowledge representation if they match the information based on time consuming searches through the information. The user who created a query and subsequent users who happen upon the information by other means could provide feedback as to whether the information actually does match the query and the feedback could be used to weight the knowledge representation and modify the matching algorithm.
Results from matching a query could be sent back to the originator directly from the server doing the matching. This requires that the originator be available as part of the query, just as the identifier is. Or results could be propagated back along the same path that the query took. Either the query remembers the path (which is more info) or each node along the path is waiting for a return and it knows the next step back. The waiting could be active with an open connection, or it could be passive, storing an identifier for the session.
The back propagation and subsequent feedback could be used to modify the stored knowledge representation of each of the links along the path as well as the source of the information. A match that was judged valuable would strengthen the connection all along the path to the source that matched, at least regarding the parameters and algorithms used at each node in the path that contributed to the eventual match.
This seems to be looking something like the Ingrid system. Keywords are used to represent each node and relationships to other nodes. But I think the system I am designing is more open-ended. I don't want to constrain the form of the knowledge representations because that needs to be explored, probably forever. Furthermore, Ingrid does centralized searching of distributed indexes.
Another problem with propagating queries is that a query will need to be transformed into a query that the subsequent node understands. This is normally called query forwarding, but in this context of decentralize searching, the forwarding agent does not necessarily wait for the result to come back.
There are two alternatives for how to do the transforming. Each connection could know what type of query needs to be passed on and do the transformation before forwarding, or the query could be forwarded first and each node itself would have to know how to transform the query into its native language. The former requires no standard query language, whereas the latter does. Not requiring a standard is a big win.
In addition to passing on a translated query, the original query, and other translations along the way could be passed as well, in an accumulation of alternative representations of the query. That way, the translation time could be saved, and the query might be able to be passed on to more areas than would otherwise be possible due to limited knowledge about how to translate to a particular related nodes query language.
A big danger of decentralized searching is queries getting out of control, spreading further than they should, and consuming more and more resources. It is to each nodes advantage to keep this from happening, since an out of control query damages the nodes that contribute to its occurrence. How can queries be controlled?
One way to control queries is a time-to-live. A real-time, termination time, absolute or relative, for the query specifies when the query processing should end, after which the best results from each node are returned. Differences in time across machines will extend the search slightly differently in various parts of the web. A mistake in the time control might cause large increases in the life of the query.
Another way to control queries is a decay factor or maximum distance. Each step along the path from the originator would decrement a counter that goes along with the query. When the counter reaches zero, that's the end of the path for that branch of the query. Other branches could continue. It is difficult to make a mistake in decrementing a counter or checking for zero, but it could happen. Each node might have a counter maximum such that if a query's counter were greater than the maximum, it would be reset to the maximum. (Similar constraints could be placed on a time-to-live.) A mistake would tend to die off except in areas of the web that did not have such constraints.
An advantage of the time-to-live method is that the fastest paths would tend to be rewarded, despite how long they were. An advantage of the maximum distance method is that the shortest paths would tend to be rewarded, despite how long they took. The faster paths are better for use in future searches, and the shorter paths are better for organization and browsing. A hybrid of the time-to-live and maximum distance could be used to get the best of both.
Another problem of distributed searching is uniformity of evaluation. With several search engines each performing their searches using their own criterion for ranking the results, the synthesis of the various results is a difficult problem. Accuracy would appear to be sacrificed. However, with feedback from the user as to the value of the results, the search engines themselves have the incentive to give accurate results, otherwise the paths to them will be weakened until the faulty search engines do not get used any more. This would take care of the situation where someone wants to advertise by returning their ad as a result of more queries than is appropriate.
What counts as accurate results? How are results compared? Just as queries need to be transformed, so do results. When a result set is returned, it is transformed into a set that is understood by whatever is combining the results. Part of the transformation could be a mapping of the ranking value into a different scale, say from -10 to 10 into 0 to 1000. Another way to return a result is that all possible information might be returned, with very negative rankings for irrelevant info. So the issue is what is the threshold used to determine what is relevant enough to include in the result set. This threshold could be specified as part of the query.
Decentralized query forwarding and accumulation of results could be done two different ways. Either the original path is remembered while forwarding, and reused while accumulating results, or certain nodes within the web could take on greater responsibility to manage forwarding and accumulation. The latter seems to make more sense and be more scalable since many paths might just be enroute to better or worse places and should not be held fully accountable just for having provided a link. These remote centers would specialize in various topic areas or search techniques. Paths to valuable centers should be shortened by providing short cuts rather than only strengthened.
The difference in these two ways of forwarding and accumulating is the similar to the "iterative" vs "recursive" queries of DNS. But instead of merely being iterative, which assumes centralized control, we also have asynchronous queries which pass on control to the next node and let that node relay the results back to the originator, or pass on control to yet another node. Hierarchy is still appropriate sometimes, but delegation is also important.
With this asynchronous mode of returning results, how does the originator know when the results are all in?
A more condensed version of some of the above ideas is my position paper, and slides for a presentation, for a W3C Distributed Indexing and Searching workshop.
A SEARCH method for HTTP would facilitate distributed searching.