Web Scraping Service
This document defines a Web service and a simple XML vocabulary for scraping resource from the web and returning them as XML documents representing the information content of the resource, or a portion of the resource. This is a critical component of a Mashup Server as it provides a bridge between Web services and the wealth of information already present on the Web.
Status: This spec has no official status at this time, and is likely to evolve rapidly.
Web Scraping Language
The primary format for controlling the extraction of data from a web page and returning it as XML is the web scraping language.
<scrape>
[ steps ]*
</scrape>
The scrape element is a container for a sequence of scraping steps. Each step is executed in a context that includes:
- a primary input, which is an XML node list)
- a set of named variables, which are XML node lists.
The output of a sequence of scraping steps is an XML node list. A node list can be serialized as children (or attributes) of an XML element wrapper.
Most attribute values can be parameterized using an attribute value template of the form:
attributeValueTemplate ::= CharData? (( openBrace | closeBrace | variableName ) CharData?)*
CharData ::= [^{}]*
openBrace ::= '{{'
closeBrace ::= '}}'
variableName ::= '{$' NCName '}'
The effective value of an attribute value template is obtained by replacing '' with '{' and '' with '}', and substituting '{$variable}' with the value of the variable named. The string-value of each element, text, attribute, and namespace node of the node-list is calculated in order and inserted with a space separator if necessary (more than one node in the node list). Note that comments and processing instructions are not serialized in this form.
Many element values can be parameterized using an element value template. An element value template consists of XML in which both attribute values and text nodes are treated as described above for attribute value templates.
It is an error for the attribute or element value template to be improperly escaped, for the variable name to be undefined, or for the resulting value of the attribute or element value template to be illegal in the context of that attribute or element value.
Scraping steps
<scrape> step*
<scrape as-variable="xs:NCName"?>
[ steps ]*
</scrape> *
Besides its use as an outer container, the scrape element can also be nested. If the primary output of the inner scrape is used by following steps, nesting scrape elements is essentially a no-op (but facilitates XInclude). However, the result of a nested scrape can be placed in a named variable using the "as-variable" attribute.
- @as-variable OPTIONAL xs:NCName: instead of passing the resource to the next step as the primary input, the resource can be assigned to a variable, and referenced in following steps by that variable name. The next step, if it requires an input, will be passed the primary input unchanged.
<http> step
<http url="xs:anyURI" method="get | post | ..."? username="xs:string"? password="xs:string"? as-variable="xs:NCName"?_>
<param name="xs:token">xs:string</param>*
</http>
The http element fetches a resource from the supplied URL, constructing the request message, using the following attributes and child elements:
- @url REQUIRED xs:anyURI (avt): the resource to fetch
- @method OPTIONAL xs:token (avt): the method to use for the request. Defaults to "get". Not case-sensitive. (Limit to get|post?)
- @username OPTIONAL xs:string (avt): the username if the url requires basic authentication.
- @password OPTIONAL xs:string (avt): the password if the url requires basic authentication.
- @as-variable OPTIONAL xs:NCName: instead of passing the resource to the next step as the primary input, the resource can be assigned to a variable, and referenced in following steps by that variable name. The next step, if it requires an input, will be passed the primary input unchanged.
- param OPTIONAL xs:NCName (avt): An HTTP parameter to be sent with the request.
- param/@name REQUIRED xs:token (avt) (is that the right type?): The name of the parameter.
The http step's primary input is ignored. The http step's primary output is an XML node list consisting of a single node representing the resource as XML. For XML-based media types, the node list is the top-level nodes in the XML document. For HTML, the node list contains a tidied HTML element.
[faults tbd]
<xml> step
<xml as-variable="xs:NCName"?>*
...
</xml>
The xml step defines the primary input, or a variable as a well-formed XML fragment.
- @as-variable OPTIONAL xs:NCName: instead of passing the resource to the next step as the primary input, the resource can be assigned to a variable, and referenced in following steps by that variable name. The next step, if it requires an input, will be passed the primary input unchanged.
[Note: uses reserved word.]
<xpath> step
<xpath expression="xs:string" trim="xs:string"? as-variable="xs:NCName"? />*
The xpath step evaluates the primary input and returns an XML node list, which is the list obtained by concatenating the results of evaluating each node in the input list with the following XPath evaluation context:
- The context node is the node in the node list currently being evaluated.
- The context position is the position of the node in the nodelist.
- The context size is the number of nodes in the nodelist.
- The set of variable bindings is the set of variables declared in previous steps using the "as-variable" attribute.
- The function library is empty.
- The set of namespace declarations is the set of declarations in scope on the <xpath> element.
Each node list resulting from evaluation of the input node list as above is further trimmed by removing the nodes identified by evaluating the "trim" expression, if specified, evaluated using the context above, except that:
- The context node is the result node subject to trimming.
- The context position is the position of the node in the node list resulting from the above evaluation.
- The context size is the number of nodes in the node list resulting from the above evaluation.
The following attributes are defined for the xpath element:
- @expression REQUIRED xs:string (avt?): XPath 1.0 expression to be evaluated.
- @trim OPTIONAL xs:string (avt?): XPath 1.0 expression, relative to each node in the primary expression, identifying nodes that should be trimmed from the resulting tree.
- @as-variable OPTIONAL xs:NCName: instead of passing the resource to the next step as the primary input, the resource can be assigned to a variable, and referenced in following steps by that variable name. The next step, if it requires an input, will be passed the primary input unchanged.
[TBD: XSLT step? Some built-in control over caching?]
Web Scraping service
The basic idea is to send a control structure in the format described above, and get back a return block of XML.
[TBD]
Tooling support
Simple mockup.