Reasons to consider Distributed Searching
- Reasons for Distributed Indexing
- Single orgs cannot maintain high quality for everything
- specialists can do better - knowledge is greater closer
to its source (but greater diversity means less standards)
- Indexing already needs to be distributed for global coverage
and to reduce redundant network and server load - crawlers are
searching just-in-case.
- Enables freemarket of multiple classifications (ontologies)
(but less standards means more chaos)
- Successful central servers are swamped until they fail.
(but supporters fund growth to the limit of technology - how far?)
- Replication of central servers is more expensive and wasteful.
I.e. just-in-time sometimes cheaper than just-in-case.
- Reduce server load by distributing searching over several servers.
(but network load and total server overhead is greater?)
- Alternative search services even for same speciality provide
incentive to improve quality. (but also applies to central servers)
- Full content searches feasible on small scale, not on global scale.
- Smaller search service has lower barrier to entry.
Distributed Searching Requirements
- Distributed Indexing
(consider distributed searching of central index)
- Organized Semantic Structure - otherwise we're flying blind
- Need either Single Standard for Query Language and Result Data or
mappings between few standards
- difficult either way
- How to do query refinement and relevance feedback with
distributed search?
Alternative Architectures for Distributed Searching
- Centralized vs Distributed across many processes
- Immediate vs Distributed (delayed) across time (e.g. agents)
- Collection issues:
- Search server vs collection
- Localized vs Distributed collections
- Provider-specified vs User-specified collections
- Small (<100) vs Large (>100K) collections
- Flat vs Structured collections
- Static vs Dynamic
- Search index (preprocessed metadata) vs content (raw data)
- Flat list of search engines vs 2-levels (e.g. mediators) vs
General Hierarchy vs Lattice vs Web (with cycles)
- Single vs Multiple indexes per domain area
- Centralized vs Distributed (delegated) control of many processes
- Communication issues:
- Synchronous vs Asynchronous communication between processes
- Connection vs Connectionless
- Client-Server vs Register-Notify
- Stateful vs Stateless
- Resident Search Software vs Uploaded Search Applets.
- Single Standard Queries vs Mapping between query schemes
vs no standards
- Single Standard Results (metadata, ranking) vs
Mapping between result few schemes vs no standards