A simplification of URN and URC issues

The simplification boils down to the following claims:

  1. URNs and URLs are not generally distinguishable.
  2. URCs are just structured documents.

First of all, Tim Berners-Lee has argued convincingly (to me anyway) in Naming -- /DesignIssues that names have to be addresses in order to be useful. In other words, URNs must "behave" at least as well as URLs to permit them to be looked up, to find out what they resolve to.

So if URNs are really URLs regarding the addressing aspect of their use, what else is different about them? URNs are required to be globally unique, at least within the naming scheme. A DNS name and an ISBN name might refer to the same object, though each name is unique in its own name space. This requirement can be met by other URL schemes, so global uniqueness is not specific to URNs.

URNs are required to outlive the objects they refer to. The main feature here is intentional design to support forwarding of references when objects move. But other URL schemes *could* be augmented or retrofitted to support this forwarding. HTTP already supports forwarding of http references, at least as long as the server is still around. If the server moves, then the parent server (whatever that is) would have to be made responsible for forwarding requests. The difference here between URNs and URLs is that URNs would be designed from the outset to support longevity.

Therefore, URNs have requirements (for global uniqueness, longevity, etc) that could be met by other URL schemes.

There are (at least) two other areas of commonality that are not discovered by looking at requirements alone (unless I am missing something in the requirements). First, there are multiple URL schemes, and there very likely could be multiple URN schemes. What distinguishes all the possible URN schemes from the non-URN schemes? Some have suggested that URIs that are URNs must have a "URN:" prefix. This is possible but, for one thing, it changes the URL syntax a bit: clients would have to pull off the "URN:" prefix to get to the scheme that follows (e.g. "dns:"). A larger issue, and the second area of commonality, concerns what a client is expected to do with a URN that is different from a URL. I used to believe that the resolution of a URN is always a URC (which may contain URLs), whereas the resolution of a URL is always some other kind of document or service. This distinction (and only this distinction, since none of the others is sufficient, as argued above) means that URNs would have to be identified as such. But the problem is more complex, and this relates to the second claim: URCs are just structured documents.

URCs are thought of as collections of metadata about some data, but there is alot of controversy about the distinction between metadata and data. I claim the distinction is mostly arbitrary; the only difference concerns how the meta data is used. More needs to be said on this.

If we can generalize from metadata to structured data, then it is conceivable to return structured data in response to either a URN or URL request. If that is the case, then we need to identify the type of the structured data much like the type of other documents is identified, with a MIME content-type header. The subtype would specify the particular syntax of the urc. e.g. urc/rfc822 or urc/sgml. There is another kind of type information that corresponds to the content of the data. A bibliographic record has different content than a spatial metadata record, for example. Perhaps there is another way to use MIME headers that makes more sense here.

The unification of URCs and other document types also means that management of URCs would be much like management of other types of documents. They both refer to other documents, via URL fields or embedded links, and they both need to be notified when URLs become obsolete etc. Because URCs are structured, they would generally have more specialized uses than other document types. The structure of URCs would allow clients to automatically process the contents as described above. In fact, particular types of URCs might contain (or be) forwarding information to facilitate automatic update of links.

The combination of these two claims, that URNs are the same as URLs and URCs are just structured documents, means that URN resolution happens much like any other URL resolution (which is different for each scheme, by the way) and it results in a document. If the document returned by any scheme happens to be structured data, then the client may choose to automatically process the data and perhaps transparently fetch URLs referenced by it, etc. This should be a user-configurable option.

From mail to the urc list

Different types of elements in URCs can be identified, such as higher-level bibliographic elements and lower-level structural elements such as URLs, Content-type, etc. They are different, but that doesn't lead me to want to segregate them. Instead, they should be useable throughout the URC, which leads to another possible kind of confusion: grabbing the wrong URL just because a piece of data is labelled as a URL. E.g. The publisher might be referenced in a URC not with literal text but with a URL to a page about the publisher.

If URCs are just structured data, making any finer distinction about how they will be used should be avoided at this stage. Now structures in your typical programming language use embedded structures and references to build higher-level structures out of several components. URCs should be as general - it is not difficult to do so. Embedding is greatly simplified by using a nesting notation rather than faking it with flat structures. References on the web are URIs.

From mail to the urc list

That a URL may be contained in a URC is quite different from the URC itself being *identified by* a URL. Resolving a URI (whether URL or URN) that identifies a URC should result in the URC as an object. What a client does with that URC object is up to the client.

You can also take a URI (whether URL or URN) that identifies a non-URC object and either resolve it to that object or, by a different resolution process, look up a URC for it.

What several people have in mind I think is very confusing. They want URNs to identify the object, but be resolved to URCs that are always known to be metadata, and URLs contained in URCs or found separately are the only thing that can be used to look up the actual object. So we would have two classes of identifiers (URLs and URNs) and two classes of resources (data and metadata).

The problem is that people are not getting the idea that URLs can be made persistent, and that resolving a URN to a URC is using the URN as a locator for the URC - it's a URL, with all the same problems of how to make it persistent.

More email

One thing people want of URCs is support for looking up the nearest replica of a document.

To get the nearest copy of a resource, we need several things, but URNs are not necessarily one of them. First, we need replicas and/or caches; they are easy to create.

The second thing we need is a format to package up the known replica locations - this could be the URC in some form. HTTP headers could also be used, e.g. multiple URI lines for a redirect. HTTP headers could be one of the manifestations of URCs.

The third thing we need is clients that can find and use local caches (most can) or remote caches (currently only via other caches) or can ask for URCs and know what to do with them. The easiest way to do the last (in the short term) is via a proxy.

Remote caches are more difficult than replicas because not even the server can refer the client to them. Who tells the client about replicas could be the server of the original document or anyone else, but how does the client find anyone else?

Finally, a fourth thing we need is that the client must be able to figure out which of several alternatives is closer. It could pick randomly and use timeouts to give up quickly and try elsewhere. Or it could use a new service such as SONAR from UofTenn. Servers might be able to help the client figure this out too. Both the server and client have their views of their places in the world, neither of which is likely to be complete; perhaps a hybrid would be best.

None of the above required URNs. URNs are useful as persistent identifiers, but persistence is not required for redirecting to replicas. Furthermore, I don't believe URNs should have any special role within URCs, even if we had URNs.

More Email

The location independence people hope for in a URN itself is not really there since you have to look up the URN; that is, you have to locate a service that will do something with the URN. What that service provides you might be an indirection to another URI (URN or URL) and so the indirection provides you the only true location independence. URCs are another way to get that indirection. URNs by themselves do not necessarily provide you an explicit indirection, and in fact, resolution of any URL can also return an indirection, or a URC.

The persistence of a URN is more a function of this same indirection than anything else. (There is also another kind of persistence, the promise not to reuse the same URN for another purpose - but it is only a promise.) URNs can have indirection built into how you resolve the URN, and this is how the path scheme works. You might say that resolving a URN into a URC gives you that indirection too, and that is true. But so does resolving a URL into a URC. What's special about a URN then?

More Email

URNs are intended to be persistent, but the question is how do they get that persistence? Do they get it only by being mapped to a URC which provides the indirection to the resource, or is the persistence inherent in the process of resolving the name, no matter what it resolves to, whether a URC or the resource itself? I prefer the latter (i.e. persistence is in the name resolution), but I don't think we should dictate which it should be for all URNs.

Another assumption people are making is that a URC can contain a bunch of URLs for the resource. That is true, but instead of a bunch of URLs, it could be a bunch of URNs for the resource, or any mix of URIs.

More Email

In mail to the URN list fall 1995, I defined resolution as whatever you do automatically to get to some result. The overall resolution process may be composed of a sequence of smaller resolutions, such as getting the DNS name and resolving it to an IP number (which involves several intermediate steps), asking a server to resolve a path and getting a URC, picking out a URL from the URC, resolving a URL to a redirect to another URL, etc. But getting and processing a URC doesn't need to be part of a URN resolution specifically. You could get a URC as a result of resolving a URL too.

More Email

The requirement for URNs is for persistence, not specifically that the location is not given. In fact, a URN must be used as a location to find any information associated with it, whether that is a URC, a URL, or list of URLs, or the resource itself. The path scheme allows you to reference the resource itself, and it does so in a way that lets the resource move. The reason this works is that the resolution process involves following a sequence of indirections. So each indirection is similar to what people are thinking of as a URL or URC, but it is built-in to the path scheme resolution process.

How you resolve a URN depends on what the URN points to. The path scheme lets you point at the resource "directly" (although the resolution process goes through indirections) or you can return a URC as metadata to let the client decide what to do. You can also point at a URC as the data itself rather than as metadata for something else.

More Email

What makes a URN a location is that there is a defined resolution mechanism that looks somewhere specifically (the location) for some information. This is not necessarily the location of the resource itself but it is a location all the same with the same problems of making *that* persistent. We could take this to an extreme and consider an identifier that had no defined resolution mechanism associated with it other than "search the entire internet for it since you have no hint as to where to look otherwise". This would be an identifier devoid of "location", but it would be useless for resolution.

There is another sense in which both URNs and URLs are *not* locations. Clients can choose to not resolve the identifier by the defined resolution mechanism at the usual location, and instead look up the identifier (but again as a location) in a cache, or an annotation server. Both URNs and URLs are equally able to be used this way.

Now given that the name should *be* a location in the sense described above, the question is how do we make that location persistent? It can be done simply by providing a service that will last as long as any of the resources that you want to provide access to.

URLs can be used for the same things as is planned for URNs. Note that I am not talking about using URLs in the usual way. Rather than giving a different URL to each replica of a document, we can use the same URL but ask different servers to resolve it. Caches work this way now, and replicas can work the same way. Another different way to use URLs is that the URL can resolve to a URC as metadata just as URNs are intended to do, and the URC references each of the replicas of the document via their own URLs. There is nothing about URLs that prohibits them from being used this way to gain all the benefits planned for URNs.

Control of name spaces is important to get the desired integrity, but URLs can be controlled the same way as is planned for URNs. On the other hand, centralized control for the whole world will never work for social and political reasons, so we are forced to relinquish control to designers of new name spaces and the naming authorities established under them. They can really do whatever they like no matter what rules people might try to impose on them, but people will not want to use a system unless it works reasonably for them.

PURLs, designed by OCLC, could have a high level of control given OCLCs organizational skills. On the other hand, some URN schemes might be very lax in general, as the path scheme would be (while allowing each naming authority to impose the degree of control it desires).

Guaranteeing uniqueness is pretty trivial, by the way, if you decide to do it. It comes down to human organizations deciding to manage a namespace one way or another. Technical infrastructure cannot solve the problem but it can help or hinder.

URC identification

At the time data is returned from a request, it should be identified as metadata for the requested object or the requested object itself. If the request is *for* metadata, then the result may be just the requested metadata as data or, in fact, metadata for the matadata may be returned.

References


Daniel LaLiberte (liberte@hypernews.org)
Last modified: Wed Jul 10 11:33:08 CDT 1996