[Wsf-general] Project idea for a new approach to schemas and data binding

Sanjiva Weerawarana sanjiva at wso2.com
Thu Apr 5 12:25:53 PDT 2007


Very nice writeup! +1 for posting it .. I will link to it highlighting the 
comment "any damn fool could produce a better data format than XML." ;-)

Sanjiva.

James Clark wrote:
> This is a write-up of an idea I've been thinking about for quite a
> while.  I did an off-the-cuff presentation of this to Sanjiva when he
> was in Bangkok a few days ago.  This message is an attempt to
> communicate this to everybody else. There's rather a lot of discursive,
> motivating material at the beginning: the meat of the message is towards
> the end. This is because I planning to use this message as the basis of
> my first blog entry (I've been thinking about starting a blog for some
> time, but in trying to write the first blog entry, I am beginning to
> understand why novelists have such a hard time writing the first
> sentence of a novel), unless of course you all tell me the idea is
> useless and/or incomprehensible.  So please don't be shy about
> expressing your opinions on this idea.  I want to make sure my first
> blog entry is worth reading.
> 
> I see the real pain-point for distributed computing at the moment as not
> the messaging framework but the handling of the payload.  A successful
> distributed computing platform needs
> 
> - a payload format
> - a way to express a contract that a payload must meet
> - a way to process a payload that may conform to one or more contracts
> 
> that is
> 
> - suitable for average, relatively low-skill programmers
> - allows for loose coupling (version evolution, extensibility,
> suitability for a wide variety of implementation technologies)
> 
> For the payload format, XML has to be the mainstay, not because it's
> technically wonderful, but because of the extraordinary breadth of
> adoption that it has succeeded in achieving.  This is where the JSON (or
> YAML) folks are really missing the point by proudly pointing to the
> technical advantages of their format: any damn fool could produce a
> better data format than XML.
> 
> We also have to live in a world where XSD is currently dominant as the
> wire-format for the contract (thank you, W3C, Microsoft and IBM).
> 
> But I think it's fairly obvious that current XML/XSD databinding
> technologies have major weaknesses when considered as a solution to
> problem of payload processing for a distributed computing platform. The
> two basic databinding techniques I see today are:
> 
> - Generating XSD from an implementation in a statically typed language
> which includes optional annotations; this provides a great developer
> experience, but from a coupling perspective doesn't seem much of an
> improvement beyond CORBA or DCOM.  The other problem is that it's tough
> to do this in a dynamically typed language (absent sophisticated type
> inference or mandatory annotations).
> 
> - Generating programming language stubs from an XSD which includes
> optional annotations.  This is problematic from the developer experience
> point of view: there's a mismatch between XML's fundamental structures,
> attributes and elements, which are optimized for imposing structure on
> text, and the terms in which developers naturally think of data
> structures.  Beyond this inherent problem, it's hard to author schemas
> using XSD and even harder to author schemas that have the right
> loose-coupling properties.  And the tooling often introduces additional
> coupling problems.
> 
> This pain is experienced most sharply at the moment in the SOAP world,
> because the big commercial players have made a serious investment in
> trying to produce tools that work for the average developer.  But I
> believe the REST world has basically the same problem: it's not really
> feeling the pain at the moment because REST solutions are mostly created
> by relatively elite developers who are comfortable dealing with XML
> directly.
> 
> The REST world also takes a less XML-centric view of the world, but for
> non-XML payload formats (JSON, or property-value pairs) their only
> solution to the contract problem is a MIME type, which I think is
> totally insufficient as a contract mechanism for enterprise-quality
> distributed computing.  For example, it's not enough to say "accessing
> this URI will give you JSON"; there needs to be a description the
> structure of the JSON, and that description needs to be machine
> readable.
> 
> Some people propose solving the XML-processing problem by adopting an
> XML-centric processing model, for which the leading technologies are
> XQuery and XSLT2. The fundamental problem here is the XQuery/XPath data
> model. I'm not criticizing the WGs' efforts: they've done about as good
> a job as could be done given the constraints they were working under.
> But there is no way it can overcome the constraint that a data model
> based around XML and XSD is just not very good data model for
> general-purpose computing. The structures of XML (attributes, elements
> and text) are those of SGML and these come from the world of markup.
> Considered as general purpose data structures, they suck pretty badly.
> There's a fundamental lack of composability. Why do we need both
> elements and attributes?  Why can't attributes contain elements?  Why is
> the type of thing that can occur as the content of an element not the
> same as the type of thing that can occur as a document? Why do we still
> have cruft like processing instructions and DTDs? XSD makes a (misguided
> in my view) attempt to add a OO/programming language veneer on top.  But
> it can't solve the basic problems, and, in my view, this veneer ends up
> making things worse not better.
> 
> I think there's some real progress being made in the programming
> language world.  In particular I would single out Microsoft's LINQ work.
> My doubts on this are with its emphasis on static typing.  While I think
> static typing is a invaluable within a single, controlled system, I
> think for a distributed system the costs in terms of tight coupling
> often outweigh the benefits.  I believe this is less of the case if the
> typing is structural rather than named.  But although LINQ (or at least
> newer versions of C#) have introduced some welcome structural typing
> features, named typing is still thoroughly dominant.
> 
> In the Java world, there's been a depressing lack of innovation at the
> language level from Sun; outside of Sun, I would single out Scala from
> EPFL (which can run on a JVM).  This adds some nice functional features
> which are smoothly integrated with Java-ish OO features.  XML is
> fundamentally not OO: XML is all about separating data from processing,
> whereas OO is all about combining data and processing.  Functional
> programming is a much better fit for XML: the problem is making it
> usable by the average programmer, for whom the functional programming
> mindset is very foreign.
> 
> This brings me to the main point I want to make.  There seems to me to
> be another approach for improving things in this area, which I haven't
> seen being proposed (maybe I just haven't looked in the right places).
> The basic idea is to have a schema language that operates at a different
> semantic level.  In the following description I'll call this
> yet-to-be-designed language TEDI (Type Expressions for Data Interchange,
> pronounced "Teddy").
> 
> If you look at the major scripting languages today, I think it's
> striking that at a very high level, their data structures are pretty
> similar and are composed from:
> 
> - arrays
> - maps
> - scalars/primitives or whatever you want to call them
> 
> This goes for Perl, Python, Ruby, Javascript, AWK. (PHP's array
> datastructure is a little idiosyncratic.) The SOAP data model is also
> not dissimilar.
> 
> When you drill down into the details, there are of course a lot of
> differences:
> 
> - some languages have fixed-length tuples as well as variable-length
> arrays
> 
> - most languages distinguish between a struct that has a fixed set of
> identifiers as keys and a map that can have an unlimited set keys
> (though there are often restrictions on the types of keys, for example,
> to prohibit mutable types)
> 
> - there's a wide variety of primitives: almost all languages have
> strings (though they differ in whether they are mutable) and numbers;
> beyond that, many languages have booleans, a null value, some sort of
> date-time support
> 
> TEDI would be defined in terms of a generic data model that makes a
> tasteful restricted choice from these programming languages' data
> structures: not limiting the choice to the lowest common denominator,
> but leaving our frills and focusing on the basics and on things that be
> naturally mapped into each language.  At least initially, I think I
> would restrict TEDI to trees rather than handle general graphs. Although
> graphs are important, I think the success of JSON shows that trees are
> good enough as a programmer-friendly data interchange mechanism.
> 
> I would envisage both an XML and a non-XML syntax for TEDI.  The non-XML
> syntax might have JSON flavour.  For example, a schema might look like
> this:
> 
>   { url: String, width: Integer?, height: Integer?, title: String? }
> 
> This would specify a struct with 4 keys: the value of the "url" key is a
> string; the value of the "width" key is a string or null.  You can thus
> think of the schema as being a type expression for a generic scripting
> language data structure.
> 
> The key design goal for TEDI something would be to make it easy and
> natural for a scripting-language programmer to work with.
> 
> There's one other big piece that's needed to make TEDI work:
> annotations.  Each component of a TEDI schema can have multiple,
> independent annotations, which may be inline or externally attached in
> some way. Each annotation has a prefix that identifies a binding.  A
> TEDI binding specification has to be developed for each programming
> language and each serialization that will be used with TEDI.
> 
> The most important TEDI binding specification would be the one for XML. 
> This specifies for a combination of a 
> 
> - a TEDI schema, 
> - XML binding annotations for the TEDI schema, and
> - an instance of the generic TEDI data model conforming to the schema
> 
> which XML infosets are considered correct representations of the
> instance, and also identifies one of these infosets as the canonical
> representation. The XML binding annotations should always be optional:
> there should be a default XML serialization of any TEDI instance.
> 
> For example, an instance of the example schema above might get
> serialized as
> 
> <root>
> <url>http://www.example.com/pic.jpg</url>
> <title>A fine picture</title>
> </root>
> 
> But with an annotation
> 
>   @xml.element(name="picture")
>   { url: String, width: Integer?, height: Integer?, title: String? }
> 
> it might get serialized as
> 
> <picture>
> <url>http://www.example.com/pic.jpg</url>
> <title>A fine picture</title>
> </picture>
> 
> Let's try and make this more concrete by imagining what it would look
> like for a particular scripting language, say Python.  First of all
> people in the Python community would need to get together to create a
> TEDI binding for Python.  This would work in an analogous way to the XML
> binding. It would specify for a combination of a 
> 
> - a TEDI schema, 
> - Python binding annotations for the TEDI schema, and
> - an instance of the generic TEDI data model conforming to the schema
> 
> which Python data structures are considered representations of the
> instance, and also identify one of these data structures as the
> canonical representation.
> 
> The API would be very simple.  You would have a TEDI module that
> provided functions to create schema objects in various ways.  The
> simplest way would be to create it from a string containing the non-XML
> representation of the TEDI schema complete with any inline annotations
> Any XML and Python annotations would be used; annotations from other
> bindings would be ignored.  The schema object would provide two
> fundamental operations:
> 
> - loadXML: this takes XML and returns a Python structure, throwing an
> exception if the XML is not valid according to the TEDI schema
> 
> - saveXML: this take a Python structure and returns/outputs XML,
> throwing an exception if the Python structure is not valid according to
> the schema
> 
> XML is not the only possible serialization.  The JSON community could
> develop a JSON binding. If you implemented that, then your API would
> have loadJSON and saveJSON methods as well.
> 
> One complication that must be handled in order to make this
> industrial-strength is streaming.  A good first step would be to able to
> handle the pattern where the document element contains zero or more
> header elements, and then a possibly very large number of entry
> elements, each of which is not large; you streaming solution you want in
> this case is for the API to deliver the entries as an iterator rather
> than an array.
> 
> Another challenge in designing the TEDI XML binding is handling
> extensibility.  I think the key here is for one of the TEDI *primitives*
> to be an XmlElement (or maybe XmlContent).  (This might also be useful
> in dealing with XML mixed content.)  With different TEDI schemas you
> should be able to get quite different representations out of the same
> XML document.  For a SOAP message, you might have a very generic TEDI
> schema that represents it as an array of headers and a payload (all
> being XmlElements); or you might have a TEDI schema for a specific type
> of message that represented the payload as a particular kind of
> structure. 
> 
> This shows how you could fit TEDI into a world where XML is the dominant
> wire format, but still leverage other more suitable wire formats when
> appropriate.
> 
> But how do you interop with a world that uses XSD as the wire format for
> contracts?  The minimum is to create a tool that can take  a TEDI schema
> with XML annotations and generate an XSD.  There'll be limits because of
> the limited power of XSD (and these will need to be taken into
> consideration in designing the TEDI XML binding): some of the
> constraints of the TEDI schema might not be captured by the XSD.  But
> that's a normal situation: there are often complex constraints on an XML
> document being interchanged that cannot be expressed in XSD.
> 
> A more difficult task is to take an XSD and generate a TEDI together
> with XML binding annotations.  This would be one of the main things that
> would drive adding complexity to the TEDI XML binding annotations.  I
> expect that the work of the XML Schema Patterns for Databinding WG would
> be valuable input on what was really needed.
> 
> In the future, there's still hope that the wire-format for the contract
> need not always be XSD: WSDL 2.0 makes a significant effort not to
> restrict itself to XSD; so you could potentially publish a WSDL with
> both the XSD and the TEDI for a web service.
> 
> The closest thing I've seen to TEDI is Paul Prescod's XBind language
> (http://www.prescod.net/xml/xbind/), but it has a rather different
> philosophy in that it separates validation from data binding, whereas
> TEDI integrates them.  Another difference is that Paul has written some
> code, whereas TEDI is completely vaporware at this point.
> 
> The first step in implementing TEDI would be to pick a scripting
> language (probably Ruby or Python), and do the implementation in and for
> that language.  Eventually it would be desirable to have a
> high-performance modular C engine, that could be integrated into each
> scripting language that is implemented in C, so that serialization and
> deserialization performance via TEDI would be more competitive with the
> language's native facilities (it would be interesting to see how big a
> hit TEDI would be). Similarly you would want a Java implementation to
> integrate with dynamic languages that are implemented in Java (Rhino,
> Groovy, JRuby).
> 
> James
> 
> 
> 
> 
> 
> _______________________________________________
> Wsf-general mailing list
> Wsf-general at wso2.org
> http://wso2.org/cgi-bin/mailman/listinfo/wsf-general
> 

-- 
Sanjiva Weerawarana, Ph.D.
Founder, Chairman & CEO; WSO2, Inc.; http://www.wso2.com/
email: sanjiva at wso2.com; cell: +94 77 787 6880; fax: +1 509 691 2000

"Oxygenating the Web Service Platform."




More information about the Wsf-general mailing list