PublicShow sourceuri.pl -- Process URIs

This library provides high-performance C-based primitives for manipulating URIs. We decided for a C-based implementation for the much better performance on raw character manipulation. Notably, URI handling primitives are used in time-critical parts of RDF processing. This implementation is based on RFC-3986:

http://labs.apache.org/webarch/uri/rfc/rfc3986.html

The URI processing in this library is rather liberal. That is, we break URIs according to the rules, but we do not validate that the components are valid. Also, percent-decoding for IRIs is liberal. It first tries UTF-8; then ISO-Latin-1 and finally accepts %-characters verbatim.

Earlier experience has shown that strict enforcement of the URI syntax results in many errors that are accepted by many other web-document processing tools.

Source uri_components(+URI, -Components) is det
uri_components(-URI, +Components) is det
Break a URI into its 5 basic components according to the RFC-3986 regular expression:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
Arguments:
Components- is a term uri_components(Scheme, Authority, Path, Search, Fragment). If a URI is parsed, i.e., using mode (+,-), components that are not found are left uninstantiated (variable). See uri_data/3 for accessing this structure.
Source uri_data(?Field, +Components, ?Data) is semidet
Provide access the uri_component structure. Defined field-names are: scheme, authority, path, search and fragment
Source uri_data(+Field, +Components, +Data, -NewComponents) is semidet
NewComponents is the same as Components with Field set to Data.
Source uri_normalized(+URI, -NormalizedURI) is det
NormalizedURI is the normalized form of URI. Normalization is syntactic and involves the following steps:
Source iri_normalized(+IRI, -NormalizedIRI) is det
NormalizedIRI is the normalized form of IRI. Normalization is syntactic and involves the following steps:
See also
- This is similar to uri_normalized/2, but does not do normalization of %-escapes.
Source uri_normalized_iri(+URI, -NormalizedIRI) is det
As uri_normalized/2, but percent-encoding is translated into IRI Unicode characters. The translation is liberal: valid UTF-8 sequences of %-encoded bytes are mapped to the Unicode character. Other %XX-sequences are mapped to the corresponding ISO-Latin-1 character and sole % characters are left untouched.
See also
- uri_iri/2.
Source uri_is_global(+URI) is semidet
True if URI has a scheme. The semantics is the same as the code below, but the implementation is more efficient as it does not need to parse the other components, nor needs to bind the scheme. The condition to demand a scheme of more than one character is added to avoid confusion with DOS path names.
uri_is_global(URI) :-
        uri_components(URI, Components),
        uri_data(scheme, Components, Scheme),
        nonvar(Scheme),
        atom_length(Scheme, Len),
        Len > 1.
Source uri_resolve(+URI, +Base, -GlobalURI) is det
Resolve a possibly local URI relative to Base. This implements http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-transform
Source uri_normalized(+URI, +Base, -NormalizedGlobalURI) is det
NormalizedGlobalURI is the normalized global version of URI. Behaves as if defined by:
uri_normalized(URI, Base, NormalizedGlobalURI) :-
        uri_resolve(URI, Base, GlobalURI),
        uri_normalized(GlobalURI, NormalizedGlobalURI).
Source iri_normalized(+IRI, +Base, -NormalizedGlobalIRI) is det
NormalizedGlobalIRI is the normalized global version of IRI. This is similar to uri_normalized/3, but does not do %-escape normalization.
Source uri_normalized_iri(+URI, +Base, -NormalizedGlobalIRI) is det
NormalizedGlobalIRI is the normalized global IRI of URI. Behaves as if defined by:
uri_normalized(URI, Base, NormalizedGlobalIRI) :-
        uri_resolve(URI, Base, GlobalURI),
        uri_normalized_iri(GlobalURI, NormalizedGlobalIRI).
Source uri_query_components(+String, -Query) is det
uri_query_components(-String, +Query) is det
Perform encoding and decoding of an URI query string. Query is a list of fully decoded (Unicode) Name=Value pairs. In mode (-,+), query elements of the forms Name(Value) and Name-Value are also accepted to enhance interoperability with the option and pairs libraries. E.g.
?- uri_query_components(QS, [a=b, c('d+w'), n-'VU Amsterdam']).
QS = 'a=b&c=d%2Bw&n=VU%20Amsterdam'.

?- uri_query_components('a=b&c=d%2Bw&n=VU%20Amsterdam', Q).
Q = [a=b, c='d+w', n='VU Amsterdam'].
Source uri_authority_components(+Authority, -Components) is det
uri_authority_components(-Authority, +Components) is det
Break-down the authority component of a URI. The fields of the structure Components can be accessed using uri_authority_data/3. This predicate deals with IPv6 addresses written as [ip], returning the ip as host, without the enclosing []. When constructing an authority string and the host contains :, the host is embraced in []. If [] is not used correctly, the behavior should be considered poorly defined. If there is no balancing `]` or the host part does not end with `]`, these characters are considered normal characters and part of the (invalid) host name.
Source uri_authority_data(+Field, ?Components, ?Data) is semidet
Provide access the uri_authority structure. Defined field-names are: user, password, host and port
Source uri_encoded(+Component, +Value, -Encoded) is det
uri_encoded(+Component, -Value, +Encoded) is det
Encoded is the URI encoding for Value. When encoding (Value->Encoded), Component specifies the URI component where the value is used. It is one of query_value, fragment, path or segment. Besides alphanumerical characters, the following characters are passed verbatim (the set is split in logical groups according to RFC3986).
query_value, fragment
"-._~" | "!$'()*,;" | "@" | "/?"
path
"-._~" | "!$&'()*,;=" | "@" | "/"
segment
"-._~" | "!$&'()*,;=" | "@"
Source uri_iri(+URI, -IRI) is det
uri_iri(-URI, +IRI) is det
Convert between a URI, encoded in US-ASCII and an IRI. An IRI is a fully expanded Unicode string. Unicode strings are first encoded into UTF-8, after which %-encoding takes place.
Errors
- syntax_error(Culprit) in mode (+,-) if URI is not a legally percent-encoded UTF-8 string.
Source uri_file_name(+URI, -FileName) is semidet
uri_file_name(-URI, +FileName) is det
Convert between a URI and a local file_name. This protocol is covered by RFC 1738. Please note that file-URIs use absolute paths. The mode (-, +) translates a possible relative path into an absolute one.
Source uri_edit(+Actions, +URI0, -URI) is det
Modify a URI according to Actions. Actions is either a single action or a (nested) list of actions. Defined primitive actions are:
scheme(+Scheme)
Set the Scheme of the URI (typically http, https, etc.)
user(+User)
Add/set the user of the authority component.
password(+Password)
Add/set the password of the authority component.
host(+Host)
Add/set the host (or ip address) of the authority component.
port(+Port)
Add/set the port of the authority component.
path(+Path)
Set/extend the path component. If Path is not absolute it is taken relative to the path of URI0.
search(+KeyValues)
Extend the Key=Value pairs of the current search (query) component. New values replace existing values. If KeyValues is written as =(KeyValues) the current search component is ignored. KeyValues is a list, whose elements are one of Key=Value, Key-Value or `Key(Value)`.
fragment(+Fragment)
Set the Fragment of the uri.

Components can be removed by using a variable as value, except from path which can be reset using path(/) and query which can be dropped using query(=([])).

Arguments:
URI0- is either a valid uri or a variable to start fresh.