[Wsf-general] Project idea for a new approach to schemas and data binding

Davanum Srinivas dims at wso2.com
Thu Apr 5 18:34:11 PDT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Excellent. Could you please blog it? :)

Sanjiva Weerawarana wrote:
> Very nice writeup! +1 for posting it .. I will link to it highlighting
> the comment "any damn fool could produce a better data format than XML."
> ;-)
> 
> Sanjiva.
> 
> James Clark wrote:
>> This is a write-up of an idea I've been thinking about for quite a
>> while.  I did an off-the-cuff presentation of this to Sanjiva when he
>> was in Bangkok a few days ago.  This message is an attempt to
>> communicate this to everybody else. There's rather a lot of discursive,
>> motivating material at the beginning: the meat of the message is towards
>> the end. This is because I planning to use this message as the basis of
>> my first blog entry (I've been thinking about starting a blog for some
>> time, but in trying to write the first blog entry, I am beginning to
>> understand why novelists have such a hard time writing the first
>> sentence of a novel), unless of course you all tell me the idea is
>> useless and/or incomprehensible.  So please don't be shy about
>> expressing your opinions on this idea.  I want to make sure my first
>> blog entry is worth reading.
>>
>> I see the real pain-point for distributed computing at the moment as not
>> the messaging framework but the handling of the payload.  A successful
>> distributed computing platform needs
>>
>> - a payload format
>> - a way to express a contract that a payload must meet
>> - a way to process a payload that may conform to one or more contracts
>>
>> that is
>>
>> - suitable for average, relatively low-skill programmers
>> - allows for loose coupling (version evolution, extensibility,
>> suitability for a wide variety of implementation technologies)
>>
>> For the payload format, XML has to be the mainstay, not because it's
>> technically wonderful, but because of the extraordinary breadth of
>> adoption that it has succeeded in achieving.  This is where the JSON (or
>> YAML) folks are really missing the point by proudly pointing to the
>> technical advantages of their format: any damn fool could produce a
>> better data format than XML.
>>
>> We also have to live in a world where XSD is currently dominant as the
>> wire-format for the contract (thank you, W3C, Microsoft and IBM).
>>
>> But I think it's fairly obvious that current XML/XSD databinding
>> technologies have major weaknesses when considered as a solution to
>> problem of payload processing for a distributed computing platform. The
>> two basic databinding techniques I see today are:
>>
>> - Generating XSD from an implementation in a statically typed language
>> which includes optional annotations; this provides a great developer
>> experience, but from a coupling perspective doesn't seem much of an
>> improvement beyond CORBA or DCOM.  The other problem is that it's tough
>> to do this in a dynamically typed language (absent sophisticated type
>> inference or mandatory annotations).
>>
>> - Generating programming language stubs from an XSD which includes
>> optional annotations.  This is problematic from the developer experience
>> point of view: there's a mismatch between XML's fundamental structures,
>> attributes and elements, which are optimized for imposing structure on
>> text, and the terms in which developers naturally think of data
>> structures.  Beyond this inherent problem, it's hard to author schemas
>> using XSD and even harder to author schemas that have the right
>> loose-coupling properties.  And the tooling often introduces additional
>> coupling problems.
>>
>> This pain is experienced most sharply at the moment in the SOAP world,
>> because the big commercial players have made a serious investment in
>> trying to produce tools that work for the average developer.  But I
>> believe the REST world has basically the same problem: it's not really
>> feeling the pain at the moment because REST solutions are mostly created
>> by relatively elite developers who are comfortable dealing with XML
>> directly.
>>
>> The REST world also takes a less XML-centric view of the world, but for
>> non-XML payload formats (JSON, or property-value pairs) their only
>> solution to the contract problem is a MIME type, which I think is
>> totally insufficient as a contract mechanism for enterprise-quality
>> distributed computing.  For example, it's not enough to say "accessing
>> this URI will give you JSON"; there needs to be a description the
>> structure of the JSON, and that description needs to be machine
>> readable.
>>
>> Some people propose solving the XML-processing problem by adopting an
>> XML-centric processing model, for which the leading technologies are
>> XQuery and XSLT2. The fundamental problem here is the XQuery/XPath data
>> model. I'm not criticizing the WGs' efforts: they've done about as good
>> a job as could be done given the constraints they were working under.
>> But there is no way it can overcome the constraint that a data model
>> based around XML and XSD is just not very good data model for
>> general-purpose computing. The structures of XML (attributes, elements
>> and text) are those of SGML and these come from the world of markup.
>> Considered as general purpose data structures, they suck pretty badly.
>> There's a fundamental lack of composability. Why do we need both
>> elements and attributes?  Why can't attributes contain elements?  Why is
>> the type of thing that can occur as the content of an element not the
>> same as the type of thing that can occur as a document? Why do we still
>> have cruft like processing instructions and DTDs? XSD makes a (misguided
>> in my view) attempt to add a OO/programming language veneer on top.  But
>> it can't solve the basic problems, and, in my view, this veneer ends up
>> making things worse not better.
>>
>> I think there's some real progress being made in the programming
>> language world.  In particular I would single out Microsoft's LINQ work.
>> My doubts on this are with its emphasis on static typing.  While I think
>> static typing is a invaluable within a single, controlled system, I
>> think for a distributed system the costs in terms of tight coupling
>> often outweigh the benefits.  I believe this is less of the case if the
>> typing is structural rather than named.  But although LINQ (or at least
>> newer versions of C#) have introduced some welcome structural typing
>> features, named typing is still thoroughly dominant.
>>
>> In the Java world, there's been a depressing lack of innovation at the
>> language level from Sun; outside of Sun, I would single out Scala from
>> EPFL (which can run on a JVM).  This adds some nice functional features
>> which are smoothly integrated with Java-ish OO features.  XML is
>> fundamentally not OO: XML is all about separating data from processing,
>> whereas OO is all about combining data and processing.  Functional
>> programming is a much better fit for XML: the problem is making it
>> usable by the average programmer, for whom the functional programming
>> mindset is very foreign.
>>
>> This brings me to the main point I want to make.  There seems to me to
>> be another approach for improving things in this area, which I haven't
>> seen being proposed (maybe I just haven't looked in the right places).
>> The basic idea is to have a schema language that operates at a different
>> semantic level.  In the following description I'll call this
>> yet-to-be-designed language TEDI (Type Expressions for Data Interchange,
>> pronounced "Teddy").
>>
>> If you look at the major scripting languages today, I think it's
>> striking that at a very high level, their data structures are pretty
>> similar and are composed from:
>>
>> - arrays
>> - maps
>> - scalars/primitives or whatever you want to call them
>>
>> This goes for Perl, Python, Ruby, Javascript, AWK. (PHP's array
>> datastructure is a little idiosyncratic.) The SOAP data model is also
>> not dissimilar.
>>
>> When you drill down into the details, there are of course a lot of
>> differences:
>>
>> - some languages have fixed-length tuples as well as variable-length
>> arrays
>>
>> - most languages distinguish between a struct that has a fixed set of
>> identifiers as keys and a map that can have an unlimited set keys
>> (though there are often restrictions on the types of keys, for example,
>> to prohibit mutable types)
>>
>> - there's a wide variety of primitives: almost all languages have
>> strings (though they differ in whether they are mutable) and numbers;
>> beyond that, many languages have booleans, a null value, some sort of
>> date-time support
>>
>> TEDI would be defined in terms of a generic data model that makes a
>> tasteful restricted choice from these programming languages' data
>> structures: not limiting the choice to the lowest common denominator,
>> but leaving our frills and focusing on the basics and on things that be
>> naturally mapped into each language.  At least initially, I think I
>> would restrict TEDI to trees rather than handle general graphs. Although
>> graphs are important, I think the success of JSON shows that trees are
>> good enough as a programmer-friendly data interchange mechanism.
>>
>> I would envisage both an XML and a non-XML syntax for TEDI.  The non-XML
>> syntax might have JSON flavour.  For example, a schema might look like
>> this:
>>
>>   { url: String, width: Integer?, height: Integer?, title: String? }
>>
>> This would specify a struct with 4 keys: the value of the "url" key is a
>> string; the value of the "width" key is a string or null.  You can thus
>> think of the schema as being a type expression for a generic scripting
>> language data structure.
>>
>> The key design goal for TEDI something would be to make it easy and
>> natural for a scripting-language programmer to work with.
>>
>> There's one other big piece that's needed to make TEDI work:
>> annotations.  Each component of a TEDI schema can have multiple,
>> independent annotations, which may be inline or externally attached in
>> some way. Each annotation has a prefix that identifies a binding.  A
>> TEDI binding specification has to be developed for each programming
>> language and each serialization that will be used with TEDI.
>>
>> The most important TEDI binding specification would be the one for
>> XML. This specifies for a combination of a
>> - a TEDI schema, - XML binding annotations for the TEDI schema, and
>> - an instance of the generic TEDI data model conforming to the schema
>>
>> which XML infosets are considered correct representations of the
>> instance, and also identifies one of these infosets as the canonical
>> representation. The XML binding annotations should always be optional:
>> there should be a default XML serialization of any TEDI instance.
>>
>> For example, an instance of the example schema above might get
>> serialized as
>>
>> <root>
>> <url>http://www.example.com/pic.jpg</url>
>> <title>A fine picture</title>
>> </root>
>>
>> But with an annotation
>>
>>   @xml.element(name="picture")
>>   { url: String, width: Integer?, height: Integer?, title: String? }
>>
>> it might get serialized as
>>
>> <picture>
>> <url>http://www.example.com/pic.jpg</url>
>> <title>A fine picture</title>
>> </picture>
>>
>> Let's try and make this more concrete by imagining what it would look
>> like for a particular scripting language, say Python.  First of all
>> people in the Python community would need to get together to create a
>> TEDI binding for Python.  This would work in an analogous way to the XML
>> binding. It would specify for a combination of a
>> - a TEDI schema, - Python binding annotations for the TEDI schema, and
>> - an instance of the generic TEDI data model conforming to the schema
>>
>> which Python data structures are considered representations of the
>> instance, and also identify one of these data structures as the
>> canonical representation.
>>
>> The API would be very simple.  You would have a TEDI module that
>> provided functions to create schema objects in various ways.  The
>> simplest way would be to create it from a string containing the non-XML
>> representation of the TEDI schema complete with any inline annotations
>> Any XML and Python annotations would be used; annotations from other
>> bindings would be ignored.  The schema object would provide two
>> fundamental operations:
>>
>> - loadXML: this takes XML and returns a Python structure, throwing an
>> exception if the XML is not valid according to the TEDI schema
>>
>> - saveXML: this take a Python structure and returns/outputs XML,
>> throwing an exception if the Python structure is not valid according to
>> the schema
>>
>> XML is not the only possible serialization.  The JSON community could
>> develop a JSON binding. If you implemented that, then your API would
>> have loadJSON and saveJSON methods as well.
>>
>> One complication that must be handled in order to make this
>> industrial-strength is streaming.  A good first step would be to able to
>> handle the pattern where the document element contains zero or more
>> header elements, and then a possibly very large number of entry
>> elements, each of which is not large; you streaming solution you want in
>> this case is for the API to deliver the entries as an iterator rather
>> than an array.
>>
>> Another challenge in designing the TEDI XML binding is handling
>> extensibility.  I think the key here is for one of the TEDI *primitives*
>> to be an XmlElement (or maybe XmlContent).  (This might also be useful
>> in dealing with XML mixed content.)  With different TEDI schemas you
>> should be able to get quite different representations out of the same
>> XML document.  For a SOAP message, you might have a very generic TEDI
>> schema that represents it as an array of headers and a payload (all
>> being XmlElements); or you might have a TEDI schema for a specific type
>> of message that represented the payload as a particular kind of
>> structure.
>> This shows how you could fit TEDI into a world where XML is the dominant
>> wire format, but still leverage other more suitable wire formats when
>> appropriate.
>>
>> But how do you interop with a world that uses XSD as the wire format for
>> contracts?  The minimum is to create a tool that can take  a TEDI schema
>> with XML annotations and generate an XSD.  There'll be limits because of
>> the limited power of XSD (and these will need to be taken into
>> consideration in designing the TEDI XML binding): some of the
>> constraints of the TEDI schema might not be captured by the XSD.  But
>> that's a normal situation: there are often complex constraints on an XML
>> document being interchanged that cannot be expressed in XSD.
>>
>> A more difficult task is to take an XSD and generate a TEDI together
>> with XML binding annotations.  This would be one of the main things that
>> would drive adding complexity to the TEDI XML binding annotations.  I
>> expect that the work of the XML Schema Patterns for Databinding WG would
>> be valuable input on what was really needed.
>>
>> In the future, there's still hope that the wire-format for the contract
>> need not always be XSD: WSDL 2.0 makes a significant effort not to
>> restrict itself to XSD; so you could potentially publish a WSDL with
>> both the XSD and the TEDI for a web service.
>>
>> The closest thing I've seen to TEDI is Paul Prescod's XBind language
>> (http://www.prescod.net/xml/xbind/), but it has a rather different
>> philosophy in that it separates validation from data binding, whereas
>> TEDI integrates them.  Another difference is that Paul has written some
>> code, whereas TEDI is completely vaporware at this point.
>>
>> The first step in implementing TEDI would be to pick a scripting
>> language (probably Ruby or Python), and do the implementation in and for
>> that language.  Eventually it would be desirable to have a
>> high-performance modular C engine, that could be integrated into each
>> scripting language that is implemented in C, so that serialization and
>> deserialization performance via TEDI would be more competitive with the
>> language's native facilities (it would be interesting to see how big a
>> hit TEDI would be). Similarly you would want a Java implementation to
>> integrate with dynamic languages that are implemented in Java (Rhino,
>> Groovy, JRuby).
>>
>> James
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wsf-general mailing list
>> Wsf-general at wso2.org
>> http://wso2.org/cgi-bin/mailman/listinfo/wsf-general
>>
> 


- --
Davanum Srinivas (dims at wso2.com)
Co-Founder & Director, Customer Engagements, WSO2 (http://wso2.com)
Yahoo IM: dims Cell/Mobile: +1 (508) 415 7509
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)

iD8DBQFGFaOTgNg6eWEDv1kRArFcAKDHUsCA2q2Khs7uuOQcnsLoWlK8hgCfS0SY
kxDOMzImPbNdE0PTqlq2D/s=
=1FHX
-----END PGP SIGNATURE-----




More information about the Wsf-general mailing list