URL

This article or chapter is incomplete and its contents need further attention. Some information may be missing or may be wrong, spelling and grammar may have to be improved, use your judgment!

Definition[edit | edit source]

A URL is an Internet address for a resource

A Uniform Resource Locator (URL) is a compact string representation for a resource available via the Internet

URLs are just one kind of Uniform Resource Identifiers (URIs) and formally speaking the URL Specification is obsolete and has been replaced by the URI (RFC 3986) specification. However, in practical terms it is still useful (much easier to understand than the URI specs ...).

This piece is just a short (cut&paste) summary from the obsolete RFC 1738 specification.

Formal Syntax[edit | edit source]

According to the RFC1738 specification, URLs are written as follows:

 <scheme>:<scheme-specific-part>

A URL contains the name of the scheme being used (<nowki><scheme></nowiki>) followed by a colon and then a string (the <scheme-specific-part>) whose interpretation depends on the scheme.

A scheme refers an Internet protocol like HTTP or Telnet or Email. This is why one also could write:

 <protocol>:<protocol-specific-part>

Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http").

Each scheme (protocol) further defines specific parts, e.g. see HTTP Scheme below.

Unsafe characters[edit | edit source]

Do not use the following characters (unless you know what you do)

The SPACE because significant spaces may disappear
"<" and ">" are unsafe because they are used as the delimiters around URLs in free text
the quote mark (""") is used to delimit URLs in some systems
"#", because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor
"%", because it is used for encodings of other characters.
The follow characters are unsafe because some gateways and other transport agents may eat them up: {", "}", "|", "\", "^", "~", "[", "]", and "`".

Reserved characters[edit | edit source]

Many URL schemes reserve certain characters for a special meaning, e.g. ";", "/", "?", ":", "@", "=" and "&"

Major Schemes[edit | edit source]

http                    Hypertext Transfer Protocol
ftp                     File Transfer protocol
mailto                  Electronic mail address
news                    USENET news
nntp                    USENET news using NNTP access
telnet                  Reference to interactive sessions
file                    Host-specific file names

Past (popular in the early nineties)

prospero                Prospero Directory Service
gopher                  The Gopher protocol
wais                    Wide Area Information Servers

The HTTP Scheme[edit | edit source]

An HTTP URL takes the form:

http://<host>:<port>/<path>?<searchpart>

If :<port> is omitted, the port defaults to 80. No user name or password is allowed. <path></nowki> is an HTTP selector, and <nowiki><searchpart> is a query string. The <path> is optional, as is the <searchpart> and its preceding "?". If neither <path> nor <searchpart> is present, the "/" may also be omitted.

Within the <path> and <searchpart> components, "/", ";", "?" are reserved. The "/" character may be used within HTTP to designate a hierarchical structure.

Links[edit | edit source]

The Beginners Guide to URLs (Investintech, a page with some good links)

References[edit | edit source]

Standards

RFC 1738 - URL Syntax

Related standards