1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2009-2023, University of Amsterdam 7 VU University Amsterdam 8 CWI, Amsterdam 9 SWI-Prolog Solutions b.v. 10 All rights reserved. 11 12 Redistribution and use in source and binary forms, with or without 13 modification, are permitted provided that the following conditions 14 are met: 15 16 1. Redistributions of source code must retain the above copyright 17 notice, this list of conditions and the following disclaimer. 18 19 2. Redistributions in binary form must reproduce the above copyright 20 notice, this list of conditions and the following disclaimer in 21 the documentation and/or other materials provided with the 22 distribution. 23 24 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 25 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 26 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 27 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 28 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 29 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 30 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 31 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 32 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 33 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 34 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 35 POSSIBILITY OF SUCH DAMAGE. 36*/ 37 38:- module(xpath, 39 [ xpath/3, % +DOM, +Spec, -Value 40 xpath_chk/3, % +DOM, +Spec, -Value 41 42 op(400, fx, //), 43 op(400, fx, /), 44 op(200, fy, @) 45 ]). 46:- use_module(library(record),[record/1, op(_,_,record)]). 47:- use_module(library(debug),[assertion/1]). 48 49:- autoload(library(error),[instantiation_error/1,must_be/2]). 50:- autoload(library(lists),[member/2]). 51:- autoload(library(sgml),[xsd_number_string/2]). 52 53/** <module> Select nodes in an XML DOM 54 55The library xpath.pl provides predicates to select nodes from an XML DOM 56tree as produced by library(sgml) based on descriptions inspired by the 57XPath language. 58 59The predicate xpath/3 selects a sub-structure of the DOM 60non-deterministically based on an XPath-like specification. Not all 61selectors of XPath are implemented, but the ability to mix xpath/3 calls 62with arbitrary Prolog code provides a powerful tool for extracting 63information from XML parse-trees. 64 65@see http://www.w3.org/TR/xpath 66*/ 67 68:- record 69 element(name, attributes, content). 70 71%! xpath_chk(+DOM, +Spec, ?Content) is semidet. 72% 73% Semi-deterministic version of xpath/3. 74 75xpath_chk(DOM, Spec, Content) :- 76 xpath(DOM, Spec, Content), 77 !. 78 79%! xpath(+DOM, +Spec, ?Content) is nondet. 80% 81% Match an element in a DOM structure. The syntax is inspired by 82% XPath, using () rather than [] to select inside an element. 83% First we can construct paths using / and //: 84% 85% $ =|//|=Term : 86% Select any node in the DOM matching term. 87% $ =|/|=Term : 88% Match the root against Term. 89% $ Term : 90% Select the immediate children of the root matching Term. 91% 92% The Terms above are of type _callable_. The functor specifies 93% the element name. The element name '*' refers to any element. 94% The name =self= refers to the top-element itself and is often 95% used for processing matches of an earlier xpath/3 query. A term 96% NS:Term refers to an XML name in the namespace NS. Optional 97% arguments specify additional constraints and functions. The 98% arguments are processed from left to right. Defined conditional 99% argument values are: 100% 101% $ index(?Index) : 102% True if the element is the Index-th child of its parent, 103% where 1 denotes the first child. Index can be one of: 104% $ `Var` : 105% `Var` is unified with the index of the matched element. 106% $ =last= : 107% True for the last element. 108% $ =last= - `IntExpr` : 109% True for the last-minus-nth element. For example, 110% `last-1` is the element directly preceding the last one. 111% $ `IntExpr` : 112% True for the element whose index equals `IntExpr`. 113% $ Integer : 114% The N-th element with the given name, with 1 denoting the 115% first element. Same as index(Integer). 116% $ =last= : 117% The last element with the given name. Same as 118% index(last). 119% $ =last= - IntExpr : 120% The IntExpr-th element before the last. 121% Same as index(last-IntExpr). 122% 123% Defined function argument values are: 124% 125% $ =self= : 126% Evaluate to the entire element 127% $ =content= : 128% Evaluate to the content of the element (a list) 129% $ =text= : 130% Evaluates to all text from the sub-tree as an atom 131% $ `text(As)` : 132% Evaluates to all text from the sub-tree according to 133% `As`, which is either `atom` or `string`. 134% $ =normalize_space= : 135% As =text=, but uses normalize_space/2 to normalise 136% white-space in the output 137% $ =number= : 138% Extract an integer or float from the value. Ignores 139% leading and trailing white-space 140% $ =|@|=Attribute : 141% Evaluates to the value of the given attribute. Attribute 142% can be a compound term. In this case the functor name 143% denotes the attribute and arguments perform transformations 144% on the attribute value. Defined transformations are: 145% 146% - number 147% Translate the value into a number using 148% xsd_number_string/2 from library(sgml). 149% - integer 150% As `number`, but subsequently transform the value 151% into an integer using the round/1 function. 152% - float 153% As `number`, but subsequently transform the value 154% into a float using the float/1 function. 155% - atom 156% Translate the value into a Prolog atom. Note that 157% an atom is normally the default, so ``@href`` and 158% ``@href(atom)`` are equivalent. The SGML parser 159% can return attributes as strings using the 160% attribute_value(string) option. 161% - string 162% Translate the value into a Prolog string. 163% - lower 164% Translate the value to lower case, preserving 165% the type. 166% - upper 167% Translate the value to upper case, preserving 168% the type. 169% 170% In addition, the argument-list can be _conditions_: 171% 172% $ Left = Right : 173% Succeeds if the left-hand unifies with the right-hand. 174% If the left-hand side is a function, this is evaluated. 175% The right-hand side is _never_ evaluated, and thus the 176% condition `content = content` defines that the content 177% of the element is the atom `content`. 178% The functions `lower_case` and `upper_case` can be applied 179% to Right (see example below). 180% $ contains(Haystack, Needle) : 181% Succeeds if Needle is a sub-string of Haystack. 182% $ XPath : 183% Succeeds if XPath matches in the currently selected 184% sub-DOM. For example, the following expression finds 185% an =h3= element inside a =div= element, where the =div= 186% element itself contains an =h2= child with a =strong= 187% child. 188% 189% == 190% //div(h2/strong)/h3 191% == 192% 193% This is equivalent to the conjunction of XPath goals below. 194% 195% == 196% ..., 197% xpath(DOM, //(div), Div), 198% xpath(Div, h2/strong, _), 199% xpath(Div, h3, Result) 200% == 201% 202% **Examples**: 203% 204% Match each table-row in DOM: 205% 206% == 207% xpath(DOM, //tr, TR) 208% == 209% 210% Match the last cell of each tablerow in DOM. This example 211% illustrates that a result can be the input of subsequent xpath/3 212% queries. Using multiple queries on the intermediate TR term 213% guarantee that all results come from the same table-row: 214% 215% == 216% xpath(DOM, //tr, TR), 217% xpath(TR, /td(last), TD) 218% == 219% 220% Match each =href= attribute in an <a> element 221% 222% == 223% xpath(DOM, //a(@href), HREF) 224% == 225% 226% Suppose we have a table containing rows where each first column 227% is the name of a product with a link to details and the second 228% is the price (a number). The following predicate matches the 229% name, URL and price: 230% 231% == 232% product(DOM, Name, URL, Price) :- 233% xpath(DOM, //tr, TR), 234% xpath(TR, td(1), C1), 235% xpath(C1, /self(normalize_space), Name), 236% xpath(C1, a(@href), URL), 237% xpath(TR, td(2, number), Price). 238% == 239% 240% Suppose we want to select books with genre="thriller" from a 241% tree containing elements =|<book genre=...>|= 242% 243% == 244% thriller(DOM, Book) :- 245% xpath(DOM, //book(@genre=thiller), Book). 246% == 247% 248% Match the elements =|<table align="center">|= _and_ =|<table 249% align="CENTER">|=: 250% 251% ```prolog 252% //table(@align(lower) = center) 253% ``` 254% 255% Get the `width` and `height` of a `div` element as a number, 256% and the `div` node itself: 257% 258% == 259% xpath(DOM, //div(@width(number)=W, @height(number)=H), Div) 260% == 261% 262% Note that `div` is an infix operator, so parentheses must be 263% used in cases like the following: 264% 265% == 266% xpath(DOM, //(div), Div) 267% == 268 269xpath(DOM, Spec, Content) :- 270 in_dom(Spec, DOM, Content). 271 272in_dom(//Spec, DOM, Value) :- 273 !, 274 element_spec(Spec, Name, Modifiers), 275 sub_dom(I, Len, Name, E, DOM), 276 modifiers(Modifiers, I, Len, E, Value). 277in_dom(/Spec, E, Value) :- 278 !, 279 element_spec(Spec, Name, Modifiers), 280 ( Name == self 281 -> true 282 ; element_name(E, Name) 283 ), 284 modifiers(Modifiers, 1, 1, E, Value). 285in_dom(A/B, DOM, Value) :- 286 !, 287 in_dom(A, DOM, Value0), 288 in_dom(B, Value0, Value). 289in_dom(A//B, DOM, Value) :- 290 !, 291 in_dom(A, DOM, Value0), 292 in_dom(//B, Value0, Value). 293in_dom(Spec, element(_, _, Content), Value) :- 294 element_spec(Spec, Name, Modifiers), 295 count_named_elements(Content, Name, CLen), 296 CLen > 0, 297 nth_element(N, Name, E, Content), 298 modifiers(Modifiers, N, CLen, E, Value). 299 300element_spec(Var, _, _) :- 301 var(Var), 302 !, 303 instantiation_error(Var). 304element_spec(NS:Term, NS:Name, Modifiers) :- 305 !, 306 callable_name_arguments(Term, Name0, Modifiers), 307 star(Name0, Name). 308element_spec(Term, Name, Modifiers) :- 309 !, 310 callable_name_arguments(Term, Name0, Modifiers), 311 star(Name0, Name). 312 313callable_name_arguments(Atom, Name, Arguments) :- 314 atom(Atom), 315 !, 316 Name = Atom, Arguments = []. 317callable_name_arguments(Compound, Name, Arguments) :- 318 compound_name_arguments(Compound, Name, Arguments). 319 320 321star(*, _) :- !. 322star(Name, Name). 323 324 325%! sub_dom(-Index, -Count, +Name, -Sub, +DOM) is nondet. 326% 327% Sub is a node in DOM with Name. 328% 329% @param Count is the total number of nodes in the content 330% list Sub appears that have the same name. 331% @param Index is the 1-based index of Sub of nodes with 332% Name. 333 334sub_dom(1, 1, Name, DOM, DOM) :- 335 element_name(DOM, Name0), 336 \+ Name \= Name0. 337sub_dom(N, Len, Name, E, element(_,_,Content)) :- 338 !, 339 sub_dom_2(N, Len, Name, E, Content). 340sub_dom(N, Len, Name, E, Content) :- 341 is_list(Content), 342 sub_dom_2(N, Len, Name, E, Content). 343 344sub_dom_2(N, Len, Name, Element, Content) :- 345 ( count_named_elements(Content, Name, Len), 346 nth_element(N, Name, Element, Content) 347 ; member(element(_,_,C2), Content), 348 sub_dom_2(N, Len, Name, Element, C2) 349 ). 350 351 352%! count_named_elements(+Content, +Name, -Count) is det. 353% 354% Count is the number of nodes with Name in Content. 355 356count_named_elements(Content, Name, Count) :- 357 count_named_elements(Content, Name, 0, Count). 358 359count_named_elements([], _, Count, Count). 360count_named_elements([element(Name,_,_)|T], Name0, C0, C) :- 361 \+ Name \= Name0, 362 !, 363 C1 is C0+1, 364 count_named_elements(T, Name0, C1, C). 365count_named_elements([_|T], Name, C0, C) :- 366 count_named_elements(T, Name, C0, C). 367 368 369%! nth_element(?N, +Name, -Element, +Content:list) is nondet. 370% 371% True if Element is the N-th element with name in Content. 372 373nth_element(N, Name, Element, Content) :- 374 nth_element_(1, N, Name, Element, Content). 375 376nth_element_(I, N, Name, E, [H|T]) :- 377 element_name(H, Name0), 378 \+ Name \= Name0, 379 !, 380 ( N = I, 381 E = H 382 ; I2 is I + 1, 383 ( nonvar(N), I2 > N 384 -> !, fail 385 ; true 386 ), 387 nth_element_(I2, N, Name, E, T) 388 ). 389nth_element_(I, N, Name, E, [_|T]) :- 390 nth_element_(I, N, Name, E, T). 391 392 393%! modifiers(+Modifiers, +I, +Clen, +DOM, -Value) 394% 395% 396 397modifiers([], _, _, Value, Value). 398modifiers([H|T], I, L, Value0, Value) :- 399 modifier(H, I, L, Value0, Value1), 400 modifiers(T, I, L, Value1, Value). 401 402modifier(M, _, _, _, _) :- 403 var(M), 404 !, 405 instantiation_error(M). 406modifier(Index, I, L, Value0, Value) :- 407 implicit_index_modifier(Index), 408 !, 409 Value = Value0, 410 index_modifier(Index, I, L). 411modifier(index(Index), I, L, Value, Value) :- 412 !, 413 index_modifier(Index, I, L). 414modifier(Function, _, _, In, Out) :- 415 xpath_function(Function), 416 !, 417 xpath_function(Function, In, Out). 418modifier(Function, _, _, In, Out) :- 419 xpath_condition(Function, In), 420 Out = In. 421 422implicit_index_modifier(I) :- 423 integer(I), 424 !. 425implicit_index_modifier(last). 426implicit_index_modifier(last-_Expr). 427 428index_modifier(Var, I, _L) :- 429 var(Var), 430 !, 431 Var = I. 432index_modifier(last, I, L) :- 433 !, 434 I =:= L. 435index_modifier(last-Expr, I, L) :- 436 !, 437 I =:= L-Expr. 438index_modifier(N, I, _) :- 439 N =:= I. 440 441xpath_function(self, DOM, Value) :- % self 442 !, 443 Value = DOM. 444xpath_function(content, Element, Value) :- % content 445 !, 446 element_content(Element, Value). 447xpath_function(text, DOM, Text) :- % text 448 !, 449 text_of_dom(DOM, atom, Text). 450xpath_function(text(As), DOM, Text) :- % text(As) 451 !, 452 text_of_dom(DOM, As, Text). 453xpath_function(normalize_space, DOM, Text) :- % normalize_space 454 !, 455 text_of_dom(DOM, string, Text0), 456 normalize_space(atom(Text), Text0). 457xpath_function(number, DOM, Number) :- % number 458 !, 459 text_of_dom(DOM, string, Text0), 460 normalize_space(string(Text), Text0), 461 catch(xsd_number_string(Number, Text), _, fail). 462xpath_function(@Name, element(_, Attrs, _), Value) :- % @Name 463 !, 464 ( atom(Name) 465 -> memberchk(Name=Value, Attrs) 466 ; compound(Name) 467 -> compound_name_arguments(Name, AName, AOps), 468 memberchk(AName=Value0, Attrs), 469 translate_attribute(AOps, Value0, Value) 470 ; member(Name=Value, Attrs) 471 ). 472xpath_function(quote(Value), _, Value). % quote(Value) 473 474xpath_function(self). 475xpath_function(content). 476xpath_function(text). 477xpath_function(text(_)). 478xpath_function(normalize_space). 479xpath_function(number). 480xpath_function(@_). 481xpath_function(quote(_)). 482 483translate_attribute([], Value, Value). 484translate_attribute([H|T], Value0, Value) :- 485 translate_attr(H, Value0, Value1), 486 translate_attribute(T, Value1, Value). 487 488translate_attr(number, Value0, Value) :- 489 xsd_number_string(Value, Value0). 490translate_attr(integer, Value0, Value) :- 491 xsd_number_string(Value1, Value0), 492 Value is round(Value1). 493translate_attr(float, Value0, Value) :- 494 xsd_number_string(Value1, Value0), 495 Value is float(Value1). 496translate_attr(atom, Value0, Value) :- 497 atom_string(Value, Value0). 498translate_attr(string, Value0, Value) :- 499 atom_string(Value0, Value). 500translate_attr(lower, Value0, Value) :- 501 ( atom(Value0) 502 -> downcase_atom(Value0, Value) 503 ; string_lower(Value0, Value) 504 ). 505translate_attr(upper, Value0, Value) :- 506 ( atom(Value0) 507 -> upcase_atom(Value0, Value) 508 ; string_upper(Value0, Value) 509 ). 510 511xpath_condition(Left = Right, Value) :- % = 512 !, 513 var_or_function(Left, Value, LeftValue), 514 process_equality(LeftValue, Right). 515xpath_condition(contains(Haystack, Needle), Value) :- % contains(Haystack, Needle) 516 !, 517 val_or_function(Haystack, Value, HaystackValue), 518 val_or_function(Needle, Value, NeedleValue), 519 atom(HaystackValue), atom(NeedleValue), 520 ( sub_atom(HaystackValue, _, _, _, NeedleValue) 521 -> true 522 ). 523xpath_condition(Spec, Dom) :- 524 in_dom(Spec, Dom, _). 525 526 527%! process_equality(+Left, +Right) is semidet. 528% 529% Provides (very) partial support for XSLT functions that can be 530% applied according to the XPath 2 specification. 531% 532% For example the XPath expression in [1], and the equivalent 533% Prolog expression in [2], would both match the HTML element in 534% [3]. 535% 536% == 537% [1] //table[align=lower-case(center)] 538% [2] //table(@align=lower_case(center)) 539% [3] <table align="CENTER"> 540% == 541 542process_equality(Left, Right) :- 543 var(Right), 544 !, 545 Left = Right. 546process_equality(Left, lower_case(Right)) :- 547 !, 548 downcase_atom(Left, Right). 549process_equality(Left, upper_case(Right)) :- 550 !, 551 upcase_atom(Left, Right). 552process_equality(Left, Right) :- 553 Left = Right, 554 !. 555process_equality(Left, Right) :- 556 atom(Left), 557 atomic(Right), 558 \+ atom(Left), 559 atom_string(Left, Right). 560 561var_or_function(Arg, _, Arg) :- 562 var(Arg), 563 !. 564var_or_function(Func, Value0, Value) :- 565 xpath_function(Func), 566 !, 567 xpath_function(Func, Value0, Value). 568var_or_function(Value, _, Value). 569 570val_or_function(Arg, _, Arg) :- 571 var(Arg), 572 !, 573 instantiation_error(Arg). 574val_or_function(Func, Value0, Value) :- % TBD 575 xpath_function(Func, Value0, Value), 576 !. 577val_or_function(Value, _, Value). 578 579 580%! text_of_dom(+DOM, +As, -Text:atom) is det. 581% 582% Text is the joined textual content of DOM. 583 584text_of_dom(DOM, As, Text) :- 585 phrase(text_of(DOM), Tokens), 586 ( As == atom 587 -> atomic_list_concat(Tokens, Text) 588 ; As == string 589 -> atomics_to_string(Tokens, Text) 590 ; must_be(oneof([atom,string]), As) 591 ). 592 593text_of(element(_,_,Content)) --> 594 text_of_list(Content). 595text_of([]) --> 596 []. 597text_of([H|T]) --> 598 text_of(H), 599 text_of(T). 600 601 602text_of_list([]) --> 603 []. 604text_of_list([H|T]) --> 605 text_of_1(H), 606 text_of_list(T). 607 608 609text_of_1(element(_,_,Content)) --> 610 !, 611 text_of_list(Content). 612text_of_1(Data) --> 613 { assertion(atom_or_string(Data)) }, 614 [Data]. 615 616atom_or_string(Data) :- 617 ( atom(Data) 618 -> true 619 ; string(Data) 620 )