1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2009-2023, University of Amsterdam 7 VU University Amsterdam 8 CWI, Amsterdam 9 SWI-Prolog Solutions b.v. 10 All rights reserved. 11 12 Redistribution and use in source and binary forms, with or without 13 modification, are permitted provided that the following conditions 14 are met: 15 16 1. Redistributions of source code must retain the above copyright 17 notice, this list of conditions and the following disclaimer. 18 19 2. Redistributions in binary form must reproduce the above copyright 20 notice, this list of conditions and the following disclaimer in 21 the documentation and/or other materials provided with the 22 distribution. 23 24 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 25 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 26 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 27 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 28 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 29 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 30 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 31 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 32 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 33 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 34 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 35 POSSIBILITY OF SUCH DAMAGE. 36*/ 37 38:- module(xpath, 39 [ xpath/3, % +DOM, +Spec, -Value 40 xpath_chk/3, % +DOM, +Spec, -Value 41 42 op(400, fx, //), 43 op(400, fx, /), 44 op(200, fy, @) 45 ]). 46:- use_module(library(record),[record/1, op(_,_,record)]). 47:- use_module(library(debug),[assertion/1]). 48 49:- autoload(library(error),[instantiation_error/1,must_be/2]). 50:- autoload(library(lists),[member/2]). 51:- autoload(library(sgml),[xsd_number_string/2]).
68:- record
69 element(name, attributes, content).
75xpath_chk(DOM, Spec, Content) :-
76 xpath(DOM, Spec, Content),
77 !.
//
Term/
Term
The Terms above are of type callable. The functor specifies
the element name. The element name '*' refers to any element.
The name self
refers to the top-element itself and is often
used for processing matches of an earlier xpath/3 query. A term
NS:Term refers to an XML name in the namespace NS. Optional
arguments specify additional constraints and functions. The
arguments are processed from left to right. Defined conditional
argument values are:
last
last
- IntExprlast-1
is the element directly preceding the last one.index(Integer)
.last
index(last)
.last
- IntExprindex(last-IntExpr)
.Defined function argument values are:
self
content
text
text(As)
atom
or string
.normalize_space
text
, but uses normalize_space/2 to normalise
white-space in the outputnumber
@
Attributenumber
, but subsequently transform the value
into an integer using the round/1 function.number
, but subsequently transform the value
into a float using the float/1 function.@href
and
@href(atom)
are equivalent. The SGML parser
can return attributes as strings using the
attribute_value(string)
option.In addition, the argument-list can be conditions:
content = content
defines that the content
of the element is the atom content
.
The functions lower_case
and upper_case
can be applied
to Right (see example below).contains(Haystack, Needle)
h3
element inside a div
element, where the div
element itself contains an h2
child with a strong
child.
//div(h2/strong)/h3
This is equivalent to the conjunction of XPath goals below.
..., xpath(DOM, //(div), Div), xpath(Div, h2/strong, _), xpath(Div, h3, Result)
Examples:
Match each table-row in DOM:
xpath(DOM, //tr, TR)
Match the last cell of each tablerow in DOM. This example illustrates that a result can be the input of subsequent xpath/3 queries. Using multiple queries on the intermediate TR term guarantee that all results come from the same table-row:
xpath(DOM, //tr, TR), xpath(TR, /td(last), TD)
Match each href
attribute in an <a> element
xpath(DOM, //a(@href), HREF)
Suppose we have a table containing rows where each first column is the name of a product with a link to details and the second is the price (a number). The following predicate matches the name, URL and price:
product(DOM, Name, URL, Price) :- xpath(DOM, //tr, TR), xpath(TR, td(1), C1), xpath(C1, /self(normalize_space), Name), xpath(C1, a(@href), URL), xpath(TR, td(2, number), Price).
Suppose we want to select books with genre="thriller" from a
tree containing elements <book genre=...>
thriller(DOM, Book) :- xpath(DOM, //book(@genre=thiller), Book).
Match the elements <table align="center">
and <table
align="CENTER">
:
//table(@align(lower) = center)
Get the width
and height
of a div
element as a number,
and the div
node itself:
xpath(DOM, //div(@width(number)=W, @height(number)=H), Div)
Note that div
is an infix operator, so parentheses must be
used in cases like the following:
xpath(DOM, //(div), Div)
269xpath(DOM, Spec, Content) :- 270 in_dom(Spec, DOM, Content). 271 272in_dom(//Spec, DOM, Value) :- 273 !, 274 element_spec(Spec, Name, Modifiers), 275 sub_dom(I, Len, Name, E, DOM), 276 modifiers(Modifiers, I, Len, E, Value). 277in_dom(/Spec, E, Value) :- 278 !, 279 element_spec(Spec, Name, Modifiers), 280 ( Name == self 281 -> true 282 ; element_name(E, Name) 283 ), 284 modifiers(Modifiers, 1, 1, E, Value). 285in_dom(A/B, DOM, Value) :- 286 !, 287 in_dom(A, DOM, Value0), 288 in_dom(B, Value0, Value). 289in_dom(A//B, DOM, Value) :- 290 !, 291 in_dom(A, DOM, Value0), 292 in_dom(//B, Value0, Value). 293in_dom(Spec, element(_, _, Content), Value) :- 294 element_spec(Spec, Name, Modifiers), 295 count_named_elements(Content, Name, CLen), 296 CLen > 0, 297 nth_element(N, Name, E, Content), 298 modifiers(Modifiers, N, CLen, E, Value). 299 300element_spec(Var, _, _) :- 301 var(Var), 302 !, 303 instantiation_error(Var). 304element_spec(NS:Term, NS:Name, Modifiers) :- 305 !, 306 callable_name_arguments(Term, Name0, Modifiers), 307 star(Name0, Name). 308element_spec(Term, Name, Modifiers) :- 309 !, 310 callable_name_arguments(Term, Name0, Modifiers), 311 star(Name0, Name). 312 313callable_name_arguments(Atom, Name, Arguments) :- 314 atom(Atom), 315 !, 316 Name = Atom, Arguments = []. 317callable_name_arguments(Compound, Name, Arguments) :- 318 compound_name_arguments(Compound, Name, Arguments). 319 320 321star(*, _) :- !. 322star(Name, Name).
334sub_dom(1, 1, Name, DOM, DOM) :- 335 element_name(DOM, Name0), 336 \+ Name \= Name0. 337sub_dom(N, Len, Name, E, element(_,_,Content)) :- 338 !, 339 sub_dom_2(N, Len, Name, E, Content). 340sub_dom(N, Len, Name, E, Content) :- 341 is_list(Content), 342 sub_dom_2(N, Len, Name, E, Content). 343 344sub_dom_2(N, Len, Name, Element, Content) :- 345 ( count_named_elements(Content, Name, Len), 346 nth_element(N, Name, Element, Content) 347 ; member(element(_,_,C2), Content), 348 sub_dom_2(N, Len, Name, Element, C2) 349 ).
356count_named_elements(Content, Name, Count) :- 357 count_named_elements(Content, Name, 0, Count). 358 359count_named_elements([], _, Count, Count). 360count_named_elements([element(Name,_,_)|T], Name0, C0, C) :- 361 \+ Name \= Name0, 362 !, 363 C1 is C0+1, 364 count_named_elements(T, Name0, C1, C). 365count_named_elements([_|T], Name, C0, C) :- 366 count_named_elements(T, Name, C0, C).
373nth_element(N, Name, Element, Content) :- 374 nth_element_(1, N, Name, Element, Content). 375 376nth_element_(I, N, Name, E, [H|T]) :- 377 element_name(H, Name0), 378 \+ Name \= Name0, 379 !, 380 ( N = I, 381 E = H 382 ; I2 is I + 1, 383 ( nonvar(N), I2 > N 384 -> !, fail 385 ; true 386 ), 387 nth_element_(I2, N, Name, E, T) 388 ). 389nth_element_(I, N, Name, E, [_|T]) :- 390 nth_element_(I, N, Name, E, T).
397modifiers([], _, _, Value, Value). 398modifiers([H|T], I, L, Value0, Value) :- 399 modifier(H, I, L, Value0, Value1), 400 modifiers(T, I, L, Value1, Value). 401 402modifier(M, _, _, _, _) :- 403 var(M), 404 !, 405 instantiation_error(M). 406modifier(Index, I, L, Value0, Value) :- 407 implicit_index_modifier(Index), 408 !, 409 Value = Value0, 410 index_modifier(Index, I, L). 411modifier(index(Index), I, L, Value, Value) :- 412 !, 413 index_modifier(Index, I, L). 414modifier(Function, _, _, In, Out) :- 415 xpath_function(Function), 416 !, 417 xpath_function(Function, In, Out). 418modifier(Function, _, _, In, Out) :- 419 xpath_condition(Function, In), 420 Out = In. 421 422implicit_index_modifier(I) :- 423 integer(I), 424 !. 425implicit_index_modifier(last). 426implicit_index_modifier(last-_Expr). 427 428index_modifier(Var, I, _L) :- 429 var(Var), 430 !, 431 Var = I. 432index_modifier(last, I, L) :- 433 !, 434 I =:= L. 435index_modifier(last-Expr, I, L) :- 436 !, 437 I =:= L-Expr. 438index_modifier(N, I, _) :- 439 N =:= I. 440 441xpath_function(self, DOM, Value) :- % self 442 !, 443 Value = DOM. 444xpath_function(content, Element, Value) :- % content 445 !, 446 element_content(Element, Value). 447xpath_function(text, DOM, Text) :- % text 448 !, 449 text_of_dom(DOM, atom, Text). 450xpath_function(text(As), DOM, Text) :- % text(As) 451 !, 452 text_of_dom(DOM, As, Text). 453xpath_function(normalize_space, DOM, Text) :- % normalize_space 454 !, 455 text_of_dom(DOM, string, Text0), 456 normalize_space(atom(Text), Text0). 457xpath_function(number, DOM, Number) :- % number 458 !, 459 text_of_dom(DOM, string, Text0), 460 normalize_space(string(Text), Text0), 461 catch(xsd_number_string(Number, Text), _, fail). 462xpath_function(@Name, element(_, Attrs, _), Value) :- % @Name 463 !, 464 ( atom(Name) 465 -> memberchk(Name=Value, Attrs) 466 ; compound(Name) 467 -> compound_name_arguments(Name, AName, AOps), 468 memberchk(AName=Value0, Attrs), 469 translate_attribute(AOps, Value0, Value) 470 ; member(Name=Value, Attrs) 471 ). 472xpath_function(quote(Value), _, Value). % quote(Value) 473 474xpath_function(self). 475xpath_function(content). 476xpath_function(text). 477xpath_function(text(_)). 478xpath_function(normalize_space). 479xpath_function(number). 480xpath_function(@_). 481xpath_function(quote(_)). 482 483translate_attribute([], Value, Value). 484translate_attribute([H|T], Value0, Value) :- 485 translate_attr(H, Value0, Value1), 486 translate_attribute(T, Value1, Value). 487 488translate_attr(number, Value0, Value) :- 489 xsd_number_string(Value, Value0). 490translate_attr(integer, Value0, Value) :- 491 xsd_number_string(Value1, Value0), 492 Value is round(Value1). 493translate_attr(float, Value0, Value) :- 494 xsd_number_string(Value1, Value0), 495 Value is float(Value1). 496translate_attr(atom, Value0, Value) :- 497 atom_string(Value, Value0). 498translate_attr(string, Value0, Value) :- 499 atom_string(Value0, Value). 500translate_attr(lower, Value0, Value) :- 501 ( atom(Value0) 502 -> downcase_atom(Value0, Value) 503 ; string_lower(Value0, Value) 504 ). 505translate_attr(upper, Value0, Value) :- 506 ( atom(Value0) 507 -> upcase_atom(Value0, Value) 508 ; string_upper(Value0, Value) 509 ). 510 511xpath_condition(Left = Right, Value) :- % = 512 !, 513 var_or_function(Left, Value, LeftValue), 514 process_equality(LeftValue, Right). 515xpath_condition(contains(Haystack, Needle), Value) :- % contains(Haystack, Needle) 516 !, 517 val_or_function(Haystack, Value, HaystackValue), 518 val_or_function(Needle, Value, NeedleValue), 519 atom(HaystackValue), atom(NeedleValue), 520 ( sub_atom(HaystackValue, _, _, _, NeedleValue) 521 -> true 522 ). 523xpath_condition(Spec, Dom) :- 524 in_dom(Spec, Dom, _).
For example the XPath expression in [1], and the equivalent Prolog expression in [2], would both match the HTML element in [3].
[1] //table[align=lower-case(center)] [2] //table(@align=lower_case(center)) [3] <table align="CENTER">
542process_equality(Left, Right) :- 543 var(Right), 544 !, 545 Left = Right. 546process_equality(Left, lower_case(Right)) :- 547 !, 548 downcase_atom(Left, Right). 549process_equality(Left, upper_case(Right)) :- 550 !, 551 upcase_atom(Left, Right). 552process_equality(Left, Right) :- 553 Left = Right, 554 !. 555process_equality(Left, Right) :- 556 atom(Left), 557 atomic(Right), 558 \+ atom(Left), 559 atom_string(Left, Right). 560 561var_or_function(Arg, _, Arg) :- 562 var(Arg), 563 !. 564var_or_function(Func, Value0, Value) :- 565 xpath_function(Func), 566 !, 567 xpath_function(Func, Value0, Value). 568var_or_function(Value, _, Value). 569 570val_or_function(Arg, _, Arg) :- 571 var(Arg), 572 !, 573 instantiation_error(Arg). 574val_or_function(Func, Value0, Value) :- % TBD 575 xpath_function(Func, Value0, Value), 576 !. 577val_or_function(Value, _, Value).
584text_of_dom(DOM, As, Text) :- 585 phrase(text_of(DOM), Tokens), 586 ( As == atom 587 -> atomic_list_concat(Tokens, Text) 588 ; As == string 589 -> atomics_to_string(Tokens, Text) 590 ; must_be(oneof([atom,string]), As) 591 ). 592 593text_of(element(_,_,Content)) --> 594 text_of_list(Content). 595text_of([]) --> 596 []. 597text_of([H|T]) --> 598 text_of(H), 599 text_of(T). 600 601 602text_of_list([]) --> 603 []. 604text_of_list([H|T]) --> 605 text_of_1(H), 606 text_of_list(T). 607 608 609text_of_1(element(_,_,Content)) --> 610 !, 611 text_of_list(Content). 612text_of_1(Data) --> 613 { assertion(atom_or_string(Data)) }, 614 [Data]. 615 616atom_or_string(Data) :- 617 ( atom(Data) 618 -> true 619 ; string(Data) 620 )
Select nodes in an XML DOM
The library
xpath.pl
provides predicates to select nodes from an XML DOM tree as produced by library(sgml) based on descriptions inspired by the XPath language.The predicate xpath/3 selects a sub-structure of the DOM non-deterministically based on an XPath-like specification. Not all selectors of XPath are implemented, but the ability to mix xpath/3 calls with arbitrary Prolog code provides a powerful tool for extracting information from XML parse-trees.