URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.
Search engines employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages. Web crawlers perform URI normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached. Web servers may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).
There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.
The following normalizations are described in RFC 3986 [1] to result in equivalent URIs:
%3a
versus %3A
) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F.[2] Example:http://example.com/foo%2a
→ http://example.com/foo%2A
HTTP://User@Example.COM/Foo
→ http://User@example.com/Foo
%41
–%5A
and %61
–%7A
), DIGIT (%30
–%39
), hyphen (%2D
), period (%2E
), underscore (%5F
), or tilde (%7E
) do not require percent-encoding and should be decoded to their corresponding unreserved characters.[4] Example:http://example.com/%7Efoo
→ http://example.com/~foo
.
and ..
in the path component of the URI should be removed by applying the remove_dot_segments algorithm[5] to the path described in RFC 3986.[6] Example:http://example.com/foo/./bar/baz/../qux
→ http://example.com/foo/bar/qux
http://example.com
→ http://example.com/
http
scheme) with its ":" delimiter should be removed.[8] Example:http://example.com:80/
→ http://example.com/
For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:
http://example.com/foo
→ http://example.com/foo/
Applying the following normalizations result in a semantically different URI although it may refer to the same resource:
http://example.com/a/index.html
→ http://example.com/a/
http://example.com/default.asp
→ http://example.com/
http://example.com/bar.html#section1
→ http://example.com/bar.html
http://208.77.188.166/
→ http://example.com/
https://example.com/
→ http://example.com/
http://example.com/foo//bar.html
→ http://example.com/foo/bar.html
http://www.example.com/
and http://example.com/
may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example:http://www.example.com/
→ http://example.com/
http://example.com/display?lang=en&article=fred
→ http://example.com/display?article=fred&lang=en
http://example.com/display?id=123&fakefoo=fakebar
→ http://example.com/display?id=123
http://example.com/display?id=&sort=ascending
→ http://example.com/display
http://example.com/display?
→ http://example.com/display
Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI
http://example.com/story?id=xyz
appears in a crawl log several times along with
http://example.com/story_xyz
we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.
Original source: https://en.wikipedia.org/wiki/URI normalization.
Read more |