================================================ Basics ================================================ The most common usage is to extract an expression from a webpage like: xidel http://www.example.org --extract //title Instead of one or more urls, you can also pass file names or the XML data itself (xidel ".." ...). The --extract option can be abbreviated as -e, and there are different kinds of extract expressions: 1 ) XPath 3.0 and XQuery 3.0 expressions, with some extensions and additional functions. 2 ) XPath 2 and XQuery 1 expressions for legacy scripts 3 ) CSS 3 selectors 4 ) Templates, a simplified version of the page which is pattern matched against the input 5 ) Multipage templates, i.e. a file that contains templates for several pages These different kinds except multipage templates are usually automatically detected, but a certain type can be forced with the extract-kind option. Or by using the shorter --xpath "..", --xquery "..", --css ".." options. Especially XQuery and template expressions are easily confused by the auto detector. (Xidel assumes templates, if the expression starts with a "<" ) If no XPath/XQuery version is given, Xidel uses the 3.0 mode. See the sections below for a more detailed description of each kind of expression. The next important option is --follow (abbreviated as -f) to follow links on a page. E.g: xidel http://www.example.org --follow //a --extract //title This will print the titles of all pages that are linked from http://www.example.org. --follow supports the same expressions as --extract. If it selects an element, it will go to the data referenced by that element (e.g. in a @href or @src attribute). Non standard elements are converted to a string, which is interpreted as an url. It will also follow forms, and the form-function can be used to fill-in some values in the form, e.g.: xidel http://www.example.org -f "form((//form)[1], {'username': 'foobar', 'password': 'XXX'})" -e //title to submit the first form on a page to login somewhere. ============================== Recursion / Argument order and grouping =============================== You can specify multiple --extract (-e) and --follow (-f) arguments to extract values from a page, e.g. follow the links to the next pages and extract values from there as well ... Thereby it is important in which order the arguments are given, so e.g. it extracts before following, and not the other way around. You can usually read it left-to-right like an English sentence, extracting from the current page, or following to a new one, which will then become the next current page. For example: a) xidel http://site1 -e "select content 1" http://site2 -e "select content 2" This will extract "content 1" from site 1 and "content 2" from site 2 b) xidel http://site1 http://site2 -e "select content 1" -e "select content 2" This will extract "content 1" and "content 2" from site 1 as well as from site 2 c) xidel http://site1 -e "select content 1" -f "//a (: i.e. select links:)" -e "select content 2" This will extract "content 1" from site1, and "content 2" from all sites the first site has links to. d) xidel http://site1 -f "//a" -e "select content 1" -e "select content 2" This will extract "content 1" and "content 2" from all sites the first site links to, and will not extract anything from site1. e) xidel http://site1 -e "select content 1" -e "select content 2" -f "//a" This is some kind of special case. Since -f is the last option, it will repeat the previous operation, i.e. it will extract content 1 and 2 from site1 and ALL sites that can be reached by a selected link on site1 or any other of the processed sites. Only if there were another -e after -f, it would extract that from the first set of followed links and stop. In some kinds of extract expression you can create new variables, if you assign values to a variable called "_follow", that value will be included in the next follow expression. If you assign an object to _follow, its properties will override the command line parameters with the same value. Generally an option modifier (like --extract-kind) affects all succeeding options, unless there are none, then it affects the immediate preceding option. You can always override the argument order by using [ and ] to group the options. For example: f) xidel http://site1 [ -f "//a (:i.e. select links:)" -e "select content 1" ] -e "select content 2" This will extract content 1 from all sites linked by site1 and content 2 from site1 itself. I.e. the extract of content 2 is not affected by the follow-option within the [..] brackets. g) xidel http://site1 [ -f //a[@type1] --download type1/ ] [ -f //a[@type2] --download type2/ ] [ -f //a[@type3] --download type3/ ] This will download all links of type 1 in a directory type1, all links of type2 in directory type2... (if written on one line) [ and ] must be surrounded by a space. The environment variable XIDEL_OPTIONS can be used to set Xidel's default options, for example XIDEL_OPTIONS="--silent --color=never" to disable some output and coloring. =========================== XPath 2.0 / XPath 3.0 / XQuery 1.0 / XQuery 3.0 ============================ XPath expressions provide an easy and Turing-complete way to extract calculated values from X/HTML. XQuery is a superset language that can e.g. create new XML/HTML elements and documents. See http://en.wikipedia.org/wiki/XPath_3.0 and https://en.wikipedia.org/wiki/XQuery for a quick summary, or https://www.w3.org/TR/xpath-30/ , https://www.w3.org/TR/xquery-30/ and https://www.w3.org/TR/xpath-functions-30/ for all the details. Xidel also supports JSONiq and some custom extensions. If the query begins with a version declaration like xquery version "3.0"; all extensions are disabled (then it can pass 99.6% of the test cases in the QT3 XQuery Test Suite). If you use version codes like "3.0-xidel" or "3.0-jsoniq", all or some extensions are enabled. Without any version declaration the extensions are enabled, unless disabled by command-line parameters. Important extensions are: Variable assignment: $var := value adds $var to a set of global variables, which can be created and accessed everywhere. (Xidel prints the value of all variables to stdout, unless you use the --extract-exclude option) JSONiq literals true, false, null true and false are evaluated as true(), false(), null becomes jn:null() JSONiq arrays: [a,b,c] Arrays store a list of values and can be nested within each other and within sequences. jn:members converts an array to a sequence. JSONiq objects: {"name": value, ...} Objects store a set of values as associative map. The values can be accessed similar to a function call, e.g.: {"name": value, ...}("name"). Xidel also has {"name": value, ..}.name and {"name": value, ..}/name as an additional, proprietary syntax to access properties. jn:keys or $object() returns a sequence of all property names, libjn:values a sequence of values. Used with global variables, you can copy an object with obj2 := obj (objects are immutable, but properties can be changed with obj2.foo := 12, which will create a new object with the changed property) Extended strings: x"..{..}.." If a string is prefixed by an "x", all expressions inside {}-parentheses are evaluated, like in the value of a direct attribute constructor. E.g. x"There are {1+2+3} elements" prints "There are 6 elements". Special string comparison: All string comparisons are case insensitive, and "clever", e.g.: '9xy' = '9XY' < '10XY' < 'xy' This is more useful for HTML (think of @class = 'foobar'), but can be disabled by passing collation urls to the string functions. Dynamic typing: Strings are automatically converted to untypedAtomic, so 'false' = false() is true, and 1+"2" is 3. Local namespace prefix resolving: Unknown namespace prefixes are resolved with the namespace bindings of the input data. Therefore //a always finds all links, independent of any xmlns-attributes. Certain additional functions: jn:*, libjn:* The standard JSONiq and JSONlib functions file:* The functions of the EXPath file module json("str.") Parses a string as json, or downloads json from an url.(only use with trusted input) serialize-json(value) Serializes a value as JSON, i.e. converts the value to JSON and converts that to a string extract("string","regex"[,,[]]) This applies the regex "regex" to "string" and returns only the matching part. If the argument is used, only the -th submatch will be returned. can be a sequence of numbers to select multiple matches. css("sel") This returns the nodes below the context node matched by the specified css 3 selector. You can use this to combine css and XPath, like in 'css("a.aclass")/@href'. join(sequence) or join(sequence, separator) This is basically the same as string-join, but can be used for non-string values. E.g. join((1,2,3)) returns the string "1 2 3". eval("xpath") This will evaluate the string "xpath" as an XPath/XQuery expression system("..") Runs a certain program and returns its stdout result as string read() Reads a line from stdin inner-html() This is the HTML content of node ., like innerHTML in javascript. outer-html() This is the same as inner-html, but includes the node itself inner-xml() This is the XML content of node, similar to inner-html() outer-xml() Like outer-html(), but XML-serialized form(form, [overridden parameters = ()]) Converts a HTML form to an http request, by url encoding all inputs descendants of the given form node. You can give a sequence of parameters to override. e.g. form(//form[1], "foo=bar&xyz=123") returns a request for the first form, with the foo and xyz parameters overridden by bar and 123. You can also use a JSON object to set the override parameters, e.g. {"foo": "bar", "xyz": 123}, in tis case they are automatically url encoded. It returns an object with .url, .method and .post properties. request-combine(request, [overridden parameters = ()]) This function can be used to modify the object returned by form. (preliminary) The second parameter behaves like that parameter of form. match(