Anatomy of a URI (What is a URL, URN, etc.?)
This is part of a larger series titled, “How To Program Anything: The Internet“
What is a URI, URL, URN, etc.?
The year is 2017 and the internet has become an ubiquitous part of every day life for a lot of people. Whether their only interaction is through a streaming service such as Netflix or Hulu to watch their favorite shows, or they’re tweeting their lives multiple times a day, the internet has its hold on most people. So it might be easy to forget the early days of the Internet and the problems that faced adopters.
One of those problems was the ability to reference resources that could be found on the internet. It used to be that if you wanted to describe how to access something on the internet you had to tell people to open a program, login as this user, navigate to this place, enter in that information, etc. There was no real way of addressing a resource, such as a picture, an e-mail address, or a MUCK, like we might do on an envelope using “snail mail.” Yes, that’s what it used to be called. It would be like writing a letter and then telling your postman directions to your friend’s house in person for it to be delivered. That’s hardly practical.
Luckily, a number of ideas rose to prominence that allowed the same type of addressing you find in the real world to be used on the “information superhighway.” Yes, that’s what it used to be called. One of those ideas was the Uniform Resource Identifier, otherwise known as a URI. You may be familiar with the term URL, or Uniform Resource Locator, which is a form of URI. There is also the lesser known URN, or Uniform Resource Name, which we’ll also cover.
In essence, a URI is a way to “tag” some kind of “resource” on the internet. This is a little abstract, but I shall explain. A URI does not necessarily specify a location over a network, such as URL, but rather is a format specification for what you might use to tag a resource. A resource here is anything, a PDF, a JPG, an HTML file, even just a connection, such as a telnet connection. Resources can even just be data objects obtained through a REST API such as a person, or a payroll record. The main idea here is that you can assign an “identifier” to a resource with a URI. That URI then “refers” to that resource, and in fact, the act of “resolving” a non-relative URI (see below) retrieves a representation of that resource, being the actual data such as an image file.
So say for example, I upload an image to Instagram. On the Instagram servers this particular image may be located in all sorts of places. It could be duplicated across multiple servers across the world, and is probably located in a particularly nested folder on the server’s file system. However, Instagram provides me with a URI for that particular image in the form of: https://www.instagram.com/p/xxxxxxxxxx where “xxxxxxx” is a bunch of seemingly random characters (letters). This particular bit of technical-ese is a URI, that is, it identifies a particular resource (image) on the internet. This URI also happens to be a URL, in that it also specifies a network location, but regardless it is a URI.
Another example is the ISBN number of a book. ISBN numbers of books are unique identifiers assigned to books by various country-specific organizations. They are use in various commercial purposes, and in the United States the privately held company R. R. Bowker issues them. “The Neverending Story” by Michael Ende, one of my favorite books, currently is assigned the ISBN 0140386335. A URN is a form of URI that identifies the name of a resource inside a “namespace”. It is broken down from left-to-right in increasing specifity. Namespace identifiers, such as “isbn”, must be registered with the Internet Assigned Numbers Authority (IANA). In this case the URN for our book would be urn:isbn:0140386335.
Both of these examples are examples of URIs, that being, identifiers that refer to specific resources. However, it should be pointed out, at the risk of complicating things, that a URI does not have to “refer” to a tangible document or resource, it must simply provide a means of identification. For example, in what has become known as the “semantic web” URIs are used to identify not only documents, but also abstractions, concepts, and entities found in the real world. The Resource Description Framework uses URI’s in a way that they don’t have to imply retrieval of resource representations over the Internet, and in fact may not even indicate a network-based resource. The key here is that a URI identifies a resource.
For the more technically inclined, the documents I’m drawing from in regards to the rest of this article are the RFC’s that first defind the URI. These are RFC2396 and RFC3986 (a finalization of the structure of a URI). These are documents called “Request for Comments” that help define the mechanics and standards of the internet.
The Meanings of Uniform, Resource, and Identifier
URI stands for Uniform Resource Identifier. It is called thus because of it’s three namesake characteristics, that being, uniformity, its referent resources, and its nature of identification.
For example uniformity is the fact that each identifier follows a particular design, making it uniform. That means that URI’s follow a particular syntax, and can be “parsed” and understood by various programs and users, giving them a good idea of the resource in question. This gives them a certain context of their own which allows them to be used in other contexts alongside each other, that is, the ability to identify disparate resources with one format. I quote RFC3986:
Uniformity provides several benefits. It allows different types of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ. It allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers. It allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are used. It allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large, and widely used set of resource identifier.
Resources, as discussed in the introduction, can be just about anything. We are most familiar with resources that we access over the internet, such as a PDF, or image. That can actually be abstracted a bit, and also point to a resource that serves a “consistent purpose,” such as “the traffic on I-25 today”. However, as noted in the last paragraph of the introduction, a resource is not necessarily always accessible in this traditional sense, and in fact may be identifiers for people, books, corporations, abstract concepts, etc.
So, what exactly is an identifier? I again quote RFC3986 as they put it succinctly:
… An identifier embodies the information required to distinguish what is being identified from all other things within its scope of identification. Our use of the terms “identify” and “identifying” refer to this purpose of distinguishing one resource from all other resources, regardless of how that purpose is accomplished (e.g., by name, address, or context). …
That is, an identifier is basically a form of identity reference. That is, the name of a given resource. For example, your identity is that of a human being, presumably, but your identifier would be your name, such as mine, Asher Wolfstein. We can call anything or group of things by any name, and that would be its identifier. This means that we cannot assume that the identifier or name defines what is referenced, as we can name anything we want, well, anything we want. For example, the House Atreides from the book series Dune does not necessarily identify a single actual home or domicile, but an entire group of individuals. And as noted before, since a resource may not necessarily be “accessible”, say over the internet, an identifier may be defined without any means or intention to “access” it. Thus I could create an identifier for the number zero, or the color red. You wouldn’t “access” the color red on your web browser!
Globality and “End-User Context”
URI’s are usually meant as end all identifiers for whatever resource they’re referencing. What this means is that there is no additional context meant for a URI to understand a URI. For example, the URI “https://archive.wunk.me/” identifies my homepage of my blog, it does not matter how you access it, that identifies my homepage. You don’t get a different result depending on what you use to access it, that would be an example of an additional context.
However, the result of a particular URI’s interpretation may depend on an end-user’s context. The RFC mentions as an example the URI “http://localhost/” which usually points to the end-user’s own host machine:
… For example, “http://localhost/” has the same interpretation for every user of that reference, even though the network interface corresponding to “localhost” may be different for each end-user: interpretation is independent of access. …
In this case localhost is mean to indicate the local host machine the user is running, not any particular specific computer out there in the world. As well the specification implores that “URIs that identify a resource in relation to the end-user’s local context should only be used when the context itself is a defining aspect of the resource.” You can see this in the case of a filesystem URI that begins with “file://” This type of URI identifies a file in a particular folder on a user’s machine, and very definition relies on the filesystem of a user’s machine. I might use, as the specification suggests, this type of URI in a help manual when the file is installed on a user’s machine, but I wouldn’t use it in an on-line help manual which resides on a website to reference other documents on the website.
Structure of a Uniform Resource Identifier (URI)
The URI specification hopes to provide a “general syntax” for the parsing and understanding of URIs. This means that specific “types” of URIs, that being URIs of different schemes (see below), may further restrict or redefine the syntax or format for themselves, but in general the guidelines and general format remain the same throughout. For example, the URI mailto:firstname.lastname@example.org and the URI ftp://ftp.test.com/remotefolder are of two different formats for two different schemes, but rely on an underlying general format.
The general syntax of a URI is specified in the following string:
This is taken from wikipedia, which has an excellent article on URIs. The idea is that each word presents the name for that particular component of the URI, and the brackets indicate nested components that may or may not necessarily be present. For example, you don’t have to have a user:password component present to access every webpage.
As you can see from the general syntax above, there are certain characters that are set aside in general for URIs to use, and that systems implementing the specific URIs should avoid using. These characters are defined in the RFC as follows:
reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Anything above contained in quotes should not be used in a URI unless otherwise specified by the format of the scheme. You should also, by convention, avoid spaces in your URI as that can indicate another or differnt URI. If you must use these characters in the components of a URI, such as in the path component, you would use the “percent encoded” representation of that character.
There are ways to include characters in a URI that an algorithm deciphering that URI can use without infringing on any reserved characters, and that is through percent encoding. To quote the RFC:
A percent-encoding mechanism is used to represent a data octet in a component when that octet’s corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
This usually takes the form of:
pct-encoded = "%" HEXDIG HEXDIG
What this translates to is a percent sign followed by a hexadecimal digit (0-F) and another hexadecimal digit (0-F). Hexadecimal is a base-16 number system using the numbers 0-9 and the letters A-F. For more information on hexadecimal please see my tutorial on the subject.
In essence you specify an US-ASCII encoded character through hexadecimal instead of typing it outright. For example, to represent a space, which in US-ASCII is represented by the binary sequence “00100000” you would use “%20”
Syntax and Components
The RFC actually defines the general syntax of a URI using the following:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty
Here you can see that [//[user[:password]@]host[:port]] would actually correspond to the “hier-part”, or hierarchy, of a URI. The hierarchical part is made up of an authority component, and a path component. When there aren’t two double slashes, you can see the authority part is left out and we rely only on the path component.
The RFC defines the authority element of a URI using the following:
authority = [ userinfo "@" ] host [ ":" port ]
The authority element is called the authority element because it delegates to a “naming authority”. For example, for the much of the internet, rather than rely on mysterious and esoteric numbers to address servers such as 22.214.171.124, we use domain names. Well, domain names are controlled by the Internet Corporation for Assigned Names and Numbers (ICANN), who in this case employ various methods to translate a domain name into a network address behind the curtains. ICANN is then the “authority” which would direct the resolution of a URI using them as an authority to resolve said URI. That is, in my URI https://archive.wunk.me/home/, “archive.wunk.me” is the authority element of the URI, and resolution of the authority element would go to ICANN, or whomever appropriate, when the URI is resolved (that is, in this case, accessed).
Many services on the internet are run by servers, which are just special computer set-ups that allow resources to be “served” up over the network, such as web page documents and such. So, for many authorities it’s a matter of where a registered name leads to, then which port that service is accessed on, and sometime which user is accessing that port.
So for example, if I was user “asher” at the domain name archive.wunk.me accessing the server port 80 (that’s the port web servers generally use) I would type use the following as my authority element in my URI: “email@example.com:80” My URI in full might then read: “http://firstname.lastname@example.org:80/home/”.
If an authority element is present in a URI it is preceded by a double slash “//”, and is terminated by the next slash, question mark, or number sign, or by the end of the URI. If an authority is not present in a URI, a double slash precedent is not required.
So, to build on our wikipedia supplied syntax you could read the URI as follows:
hierarchy _________________|____________________ / authority \ ______________|________________ / \ scheme:[//[user[:password]@]host[:port]][/path][?query][#fragment]
RFC3986 makes a mention of “hierarchy” in the URI format:
The URI syntax is organized hierarchically, with components listed in order of decreasing significance from left to right.
What this amounts to is that a URI generally defines the most significant delineations first, such as the scheme, then authority (such as a domain name in HTTP), then a path to a particular item such as a file. On the internet files and resources are often organized in hierarchical fashions, which a URI can take advantage of depending on its scheme. Many popular URI formats, such as the web’s HyperText Transfer Protocol (HTTP) or the file transfer protocol (FTP), are based on abstracted filesystem mechanics. That is, each resource is enclosed in a folder which may be enclosed in further folders and so on. In general, most filesystems allow you to trace up and down the file structure using indicators such as “..” (move up one folder), and these kinds of URIs are the same in their path component. This means that it’s possible a URI might be relative in a system of documents or resources. That is, one HTML page in a website may refer to “../about/index.html” which would translate essentially to “move up one folder, go into the about folder, and access index.html”
The reason this is important is because we can somewhat mix and match path components with more fundamental components such as scheme and authority. What this means is that we might access a web page through HTTP to view in a web browser, such as https://archive.wunk.me/about/asher.html, but we might also access it through FTP in a client, such as ftp://ftp.archive.wunk.me/about/asher.html. We could even access it on our local computer with file://home/asher/wunk/about/asher.html. The path component remains the same while the scheme and the authority change.
In these cases, depending on the scheme and specific format of the URI, there are particular characters that are useful in distinguishing the hierarchy of specificity: “/”, “?” and “#”. How these exact characters are used depends on the URI format and scheme, but these characters are common throughout URIs in helping differentiate hierarchical relationships in the resource systems. For example, the “/” character is useful in file, ftp, and http schemes for indicating that a particular folder or directory is being accessed.
Starting from left-to-right we can examine each of the specific components of a URI:
In the RFC the syntax of a scheme is defined using:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
What this basically means is that a scheme starts with a letter, which then may be followed by another letter, a digit, or the characters “+”, “-” or “.”. Not defined above, but further up in the other definitions, the scheme in a URI is followed by a colon “:”.
Note that the scheme component in a URI refers to a specification for assigning identifiers within that scheme. This is important to note, and what it means is that a scheme may share the same name as a protocol (such as HTTP) but that doesn’t mean it has equivalence with a protocol. The scheme in relation to the URI defines the format of the URI, such as how the “urn” scheme changes the formatting (as we’ll see below). This means that a scheme name may not be the same as the protocol used in its method of retrieval. A URI starting with mailto: may indicate an e-mail address, but doesn’t define a mailto protocol for accessing that e-mail address.
User and Password
This is part of the authority element of the URI and is provided as a means to authenticate oneself to the service in question. The password part of this component is deprecated, meaning that it is no longer meant to be used, as it is very insecure (anything that prints passwords out in the open, let alone transmits them is insecure). This is useful for services and servers that operate with user accounts. It is followed by the “@” reserved character.
This is the main part of the authority element of a URI. Host may be in many formats, from an IPv4 address (such as this: 10.0.0.1), a domain name that resolves to an IPv4 address, an IPv6 address (which must be in brackets) and such. It’s main job is to point towards a service or server that can further resolve the URI into a particular resource. For URIs that don’t map to specific network resources, the host name can serve as more of a URN (see below), or perhaps point to a document explaining the use of the URI in general.
This is optional part of the authority element of a URI, and is preceded by a colon when present. Most computer systems these days talk to each other over networks by implementing what are called ports. A port is essentially an identification of a “channel” over which two entities communicate. Just like I might dial my radio to channel 94.3 to listen to a particular front range radio station, a program on a computer can instruct the computer to “bind” it to a specific port (say, port 8080) and any network traffic that computer receives that is intended for port 8080 will be redirected to that program. So, for example, say I run a web server on a computer. I bind it to port 80, now any computer that asks for http://mycomputer:80/ will send an HTTP GET request to port 80 of my computer. It just so happens that port 80 is the default port for web servers, and often times when you specify a domain name in a web browser it assumes you mean the “http” scheme, and port 80.
Other services run on other default ports. Certain protocols have different port numbers as their defaults, and servers programs running those services will often bind to those ports by default. You can also have server programs for specialized services, such as a MUD or MUCK, running on custom ports such as 237911. Say I was running some such MUD on my server, you might access it like so: telnet://archive.wunk.me:237911/.
The path component contains typically hierarchical information, such as a file located in sub-folders in a filesystem. That is why it is part of the hierarchy element in the URI specification. The path does not necessarily have to be hierarchical, or a path to a specific file. However, usually you read the path information form left-to-right, from most significant/least specific to most specific/least significant. That is the path component whittles down specificity until you are pointing at one particular thing. Combining the path with the query component (see below) you have a specific URI.
The path is terminated by the first question mark or number sign character, or the end of the URI. If the URI has an authority element then the path element must begin with a slash or be empty. If the authority element is missing in a URI then the path cannot begin with a double slash. If a URI reference is a relative-path reference, then the first path segment cannot contain a colon character. If you need a colon, this is where percent encoding can prove useful.
This is where things get much more interesting in regards to the URI again. A query is basically non-hierarchical information encoded into a URI. This information is passed on to a service, usually, that further processes it to render something specific.
Take for example google. When you type in a search to google, it goes to this URL:
As you can see now, https is the scheme, www.google.com is the authority (host), /search is the path that is, our resource. Everything after the question mark is the query. Your search words are not a hierarchical part of the information structure, instead they are parameters that you passing into the /search resource. q=your+search+here is processed by the search resource at the google domain to produce the resulting HTML page you see in your browser.
Queries don’t have to be used solely as parameters to services, they can represent any non-hierarchical data that is pertinent to the identifier. There is a “standard” of formatting for query parameters however, not in the RFC specification for URLs. It is generally understood that query parameters come in the form:
Essentially this assigns a value to a key, multiple times (thus to multiple different keys). The ampersand is the separator between key-value pairs. This is the generally accepted notion, though, depending on how your server interprets query values this is not set in stone. In addition, query values in non-acccessible URIs (strange as it may seem) don’t reallyhave any particular formatting notion.
The RFC warns about formatting query values however:
However, as query components are often used to carry identifying information in the form of “key=value” pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.
The fragment identifier component of a URI is a way to indirectly identify a secondary resource within a primary resource, or other additional identifying information. If you want your resource URI to have a context, this is the way to do it. The primary URI before the fragment is the primary URI while the fragment is the “secondary” or subordinate identifier. The fragment could refer to a portion of the primary resource, or maybe a particular “view” on representations of the primary resource. Theoretically it could even be some other resource that is further defined by the primary resource.
This identifier is indicated by the presence of a pound sign character and terminated by the end of the URI. It always comes at the end of the URI.
Due to it’s nature as a secondary resource identifier fragment identifiers are defined semantically in accord with the primary resource. That is is it “defined by the set of representations that might result from a retrieval action on the primary resource.” Thus, the fragment’s format is dependent on media type of a retrieved reprsentation, not necessarily the formattinf of the scheme itself.
You can think of the fragment as a form of document sub-referencing, or a query for the resource in question. Whereas a query applies to the resource on hand in the URI, being part of the URI, the fragment applies to the resource itself, not being really part of the URI. The fragment is meant to be used in the resolution of a URI, after the “dereferencing” so to speak. If no such representation of a resource exist the fragment identifier is considered unknown and is “unconstrained.” Fragment identifiers in this way then are independent of the URI scheme, as they are tied to the document formatting, such as HTML, and cannot be redefined by the scheme.
Thus it’s up to the media types to define their own restrictions or structures of the fragment identifier. For example, HTML specifies a fragment to point to an “a” (anchor) element of an HTML document. The http: scheme has nothing to do with this, as an http: URI might reference an image just as well as an HTML document.
Uniform Resource Locator
A Uniform Resource Locator (URL) is a type of URI. Though the term is considered obsolete now by many technicians, it still maintains a wide popular usage. A URL is essentially what we’ve covered as to what constitutes a URI, it’s scope is slightly differnt.
A URL specifies that what you are identifying is reachable and applicable across a network, particularly the internet. That is, it implies that whatever the URI is identifying is accessible and retrievable from across the internet.
There is one peculiarity of URL’s and that is the scheme-less or protocol-less URL. That is a url that starts with a double slash, with an authority element, but is missing the scheme. When a web browser finds that particular kind of URL it often will access it using the same scheme as it accessed the parent or containing resource.
Uniform Resource Name
Uniform Resource Name’s, or URN, were first described and defined in RFC1737 and RFC2141. They’ve been updated for modern times in RFC8141. In the earlier days of the internet people were still struggling with exactly how the whole thing was going to work. URN were originally conceived as one of a three part architecture for the general internet, even the web. The three parts were to be URLs, URNs, and Uniform Resource Characteristics (URC)s. In contrast to URLs, which specify the location of a resource and how to access it (via a mapping of the scheme to a protocol, such as HTTP and FTP), URN’s are thought to be identifiers defined through the use of globally unique and persistent namespaces. For those interested, URCs never made it past the idea phase, with other technologies, such as the Resource Description Framework taking its place.
URNs can be very much likened to domain names, though domain names are not technically URNs. What I mean by this analogy however is that each specifies an identifier through the use of “namespaces.” That identifier refers or “points” to something permanent that may or may not currently exist or be accessible. That is what makes them persistent and by that nature global.
For example, you read a domain name from right-to-left, in order of increasing specificity. For example, www.archive.wunk.me is read as first, the me namespace (top-level domain), wunk as being that particular name in the me namespace, and www being the particular service running at archive.wunk.me. Likewise from left-to-right you would read a urn in the same type of increasing specificity, such as urn:isbn:0140386335. First we identify the namespace of isbn, and within the namespace isbn we identify the numerical code. In this case this is a urn for the book “The Neverending Story” by Michael Ende. We can’t necessarily actually pull up, retrieve, or access the book using that URN (which is a URI) but we can “identify” the book in whatever form is may exist at any time by another name: urn:isbn:0140386335.
In this sense “The Neverending Story” by Michael Ende and urn:isbn:0140386335 are two “names” for the same entity. In the context of the above information regarding URIs then you can see that “urn:” is another scheme by which we can format URIs. The latest version of URN specification includes some elements that URIs were later defined to have, including q-components and f-components (see below).
URNs, like URLs, can be “resolved” given some kind of resolution service. The resolution service may be the purveyor of the namespace id (see below), or it may be a general purpose service such as a library. For example, a library might resolve the URN of the above book to a URL pointing to a listing of the book in their archives, or if that book doesn’t exist in the archives perhaps a URL to a listing of the book with an online bookseller. Likewise, as with other URIs, some URNs are not resolvable nor meant to be resolvable, and exist purely as identifiers that reference a, perhaps, abstract entity.
Structure of a Uniform Resource Name (URN)
RFC8141 defines the URN syntax and format using augmented backus-naur form as follows:
namestring = assigned-name [ rq-components ] [ "#" f-component ] assigned-name = "urn" ":" NID ":" NSS NID = (alphanum) 0*30(ldh) (alphanum) ldh = alphanum / "-" NSS = pchar *(pchar / "/") rq-components = [ "?+" r-component ] [ "?=" q-component ] r-component = pchar *( pchar / "/" / "?" ) q-component = pchar *( pchar / "/" / "?" ) f-component = fragment
You can distill this down to something more like this (like we previously did for URIs):
“nid” is the namespace identifier, and may include letters, digits, and the “-” dash character. The namespace-specific string, “nss”, is a string of characters that are formatted according to the namespace identifier. Think of the namespace identifier as somewhat the “scheme” of the URN, while the NSS is the specific identifier in whatever format is specific to that “scheme.” For example, in the URN urn:isbn:0140386335, “isbn” is the namespace identifier, and the namespace-specific string is 0140386335, or a string of 10 digits.
Official namespaces, the “nid” part of the urn URI format, are required to be registered with the IANA. Namespaces may be “formal” or “informal”. A formal namespace is actually an identifier, such as “isbn”. There are about sixty formal URN namespace identifiers, and cover various topics and grounds. The general idea is that users may “benefit from their publication.” Formal namespaces are restricted in several ways, most of which are obvious. They can not already be registered, they cannot start with “urn-“, they must be more than two letters long (including the dash, that is “ab-” or “xy-” is not allowed), and they may not, for deprecated reasons, cannot start with an “x-” prefix.
“Informal” namespaces are simply a number and don’t betray their purpose on surface inspection. Informal namespaces reside in the “urn-” format, being “urn-0123456789” or “urn-#”. The number is assigned on a first-come-first-serve basis by the IANA.
The deprecated kind of URN is the experimental URN namespace which started with an “x-“. This has been deprecated per RFC8141 and instead authors would prefer you use urn:example namespace.
All namespaces, formal, informal, and experimental, can be employed anywhere a URI is expected as an identifier for a particular resource.
Namespace-specific strings, or the “nss” part of the URN specification, are determined by the namespace in question. As per the running example, a ten digit number makes up the namespace-specific string for the “isbn” namespace. Other namespaces have wide and varying formats that are allowed in the namespace-specific string. The NSS may contain ASCII letters, digits, and many punctuation or other special characters. As with any URI as above, any character not allowed in a NSS can be percent encoded (see above). The RFC8141 increased the number of characters allowed in a NSS to include the slash character “/”. This slash character doesn’t necessarily indicate hierarchical data, like a path element might, it simply allows names from non-URN systems that use slashes to be included.
RFC8141 added what is known as the “q-element”. This can be likened to the query component of a generic URI as above. In fact in some instances, such as when the URN is resolvable to another URI, you might end up copying the q-element into the query of the URI. A URN identifier cannot be made to rely on a q-element, meaning, the identifier should be able to stand on its own without a q-element to encapsulate what it represents. This is what the RFC has to say:
The URN q-component has the same syntax as the URI query component but is introduced by “?=”, not “?” alone. For a URN that may be resolved to a URI that is a locator, the semantics of the q-component are identical to those for the query component of that URI. Thus, URN resolvers returning a URI that is a locator for a URN with a q-component do this by copying the q-component from the URN to the query component of the URI.
However, one difference with URI’s query component in the general syntax is outlined in the RFC:
The characters slash (“/”) and question mark (“?”) may represent data within the q-component. Note that characters outside the ASCII range MUST be percent-encoded
For more information on percent encoding, see the section above in the general URI syntax.
The f-component is much like the fragment component of a general syntax URI. In fact, when a URN might resolve to a URI, the f-component can be applied as the fragment of that URI. The f-component, like the fragment component, is used to refer to constituent parts of a resource identified by the URN. For example, a URN for a book might split up chapters using an f-component. Such as, urn:isbn:0140386335#chapter5, to refer to the fifth chapter of The Neverending Story.
To quote the RFC in regards to the similarity of the f-component to the fragment component of a general syntax URI:
The URN f-component has the same syntax as the URI fragment component. If a URN containing an f-component resolves to a single URI that is a locator associated with the named resource, the f-component from the URN can be applied (usually by the client) as the fragment of that URI. If the URN does not resolve to a URI that is a locator, the interpretation of the f-component is undefined by this specification. Thus, for URNs that may be resolved to a URI that is a locator, the semantics of f-components are identical to those of fragments for that resource.
The “r-element” is not currently standardized, but the intent is that it contains parameters you would pass to the “resolver” of a URN. Usage of the r-element should be delayed until its format and usage might be better understood, but the idea is that the r-component would be used to “control” the resolver as opposed to controlling the retrieved representation of a particular URN. This sounds obfuscated, and luckily RFC8141 gives an example, I quote:
Consider the hypothetical example of passing parameters to a resolution service (say, an ISO alpha-2 country code … in order to select the preferred country in which to search for a physical copy of a book). This could perhaps be accomplished by specifying the country code in the r-component, resulting in URNs such as:
While the above should serve as a general explanation and illustration of the intent for r-components, there are many open issues with them, including their relationship to resolution mechanisms associated with the particular URN namespace at registration time. Thus, r-components SHOULD NOT be used for URNs before their semantics have been standardized.
Other URI and URN Schemes
There are more URN/URI schemes than just the general syntax and the “urn:” schemes used today. These other URI schemes include tag:, info:, and ni: and are generally URNs as well (as opposed to URLs). “tag:” URNs are commonly used in YAML, whereas “info:” is a scheme very similar to “urn:” but addressing other needs. “info:” was deprecated in 2010 and users were encouraged to adopt other formats for their needs.
So there you have it, Uniform Resource Identifiers (URI)s are strings of characters that “identify” a particular “resource” as defined above. URIs may or may not be “resolvable” in the sense of being able to attached or access a particular resource, and indeed may even signify something abstract that may not “physically” exist. Uniform Resource Locators are examples of URIs that are used across the internet and web browsers for instance to “locate” actual resources that are accessible over networks. The term URL is deprecated by technical professionals, but still in wide popular use. Uniform Resource Names are at the other extreme, identifying a resource persistently but not necessarily a “locator” for that resource. For example, we might have the Dewey Decimal notation, or ISBN number of a book and encapsulate that in a URN such as urn:isbn:0140386335. This URN refers to a particular book whether it may or may not exist at a particular website, network, library, etc.
With URIs we can reference and locate just about everything we need to on the internet, from books, videos, sites, and games. Without URIs we’d still be stuck in specific clunky ways to access specific things using specific clients and methods, instructions which I followed in many a book before URIs became a big thing. Now, it’s simple: https://archive.wunk.me/
If you appreciate my tutorials please help support me through my Patreon.
If a monthly commitment isn’t up your alley, you can always buy me some coffee.