Voici un extrait de l'article de Shelley Powers "XML expectations". Il n'est plus disponible à l'adresse à laquelle il était initiallement publié http://www.ne-dev.com/ned-01-1998/ned-01-xml.html.

Crédit à l'éditeur Credit given to


XML expectations

by Shelley Powers

XML technology could trigger a basic shift in Web development

Extensible Markup Language is a language that defines other languages. It also has the potential to give structure and meaning to the information contained in HTML documents or any other data form -- making such information naturally as searchable and structured as the information locked into a database. Such capabilities mean that XML can turn our current view of data upside down -- instead of a static, impenetrable lump of information, a file that uses XML suddenly has a logical structure that can be manipulated, queried, and changed without delving down into the data itself. The potential of such a meta-language is huge -- if it's implemented as an open standard. Right now, XML remains such a standard, and if it continues to evolve along open lines, it could drastically improve Web-based development.

XML is similar to SQL in a number of ways; SQL is also an example of a multi-purpose language used to define data structures and query those same data structures without concern as to how the information is displayed or used. The only guarantees are that the information is defined in structures, the structures follow certain rules, and the information contained within the structures can be accessed automatically or manually. Both SQL and XML define structures for information in the form of elements, element attributes, and element content. The main difference is that instead of defining data that is stored in a physical storage medium usually only accessible by a database engine, XML describes data that is stored and accessed from within documents.

XML's parent: SGML
XML is a subset of SGML (Standardized General Markup Language), a generalized markup language that was passed as an ISO standard in the 1980s. Rather than specifying a language's elements directly, SGML is used to define the rules that constrain the elements of a specific language.

SGML grew out of the need to define a document's structure and to define rules used to determine whether the document is valid and well formed. The document's structure is defined through the use of markup tags, which delimit the elements, and Document Type Definition (DTD) files that define each element's structure and content, providing a sort of grammar for the document.

For example, using SGML, a customer element within a document could have the following structure:

<CUSTOMER name="Shelley Powers" id="CUST011A1">
<PO id="PO23349008">
<POITEM id="POI1">
<ITEM id="14453">
Item ID: 14453
Item Desc: some description
</ITEM>
</POITEM>
</PO>
</CUSTOMER>

To validate the markups used to define the structure of the document, an associated DTD would have to be created, with statements similar to the following:

<!ELEMENT customer - - (POITEM)+><!ATTLIST customer   name CDATA   id CDATA>

This extremely simplified and abbreviated DTD uses an Extended Backus-Naur Form (EBNF) syntactic notation to create the grammar.

Using a standardized meta-language to define entities within a document allows SGML parsers to pull out the individual entities (such as the customer entity just described) and any associated attributes and content. An application can then use that information for a number of purposes, including the following possibilities:

  • To define information in a database-neutral format for transport between unlike databases.
  • To provide a search engine that allows a person to query on the entity type as well as the data.
  • For report generation, or even an online hypertext order processing form that allows the reader to drill down within the document to find the desired information.
  • To define a standard language for a specific industry or science, such as the petroleum industry or chemistry, which includes special notational conventions.

The concept of SGML is very attractive: Define a language that in turn defines a document structure used for a specific group of documents and which can be extended without impacting the underlying language generation mechanism. Unfortunately, the downside to SGML is that it is far from trivial to define the DTD for a language. SGML is a complex standard that is difficult to implement.

HTML, a derivative of SGML
SGML did, however, provide the roots of the first Web document specification, HTML. HTML was derived from SGML, except that it predefined a group of elements that controlled the delivery of a Web page's content. In addition, the original HTML elements were expanded to include suggested presentation elements that controlled the appearance of the Web page. SGML does not control the presentation of elements, only the element structure and semantics.

The following code defines an HTML unnumbered list element, which is defined by the DTD associated with the HTML 4.0 specification as having a start and end tag and containing at least one list item:

<!ELEMENT UL - - (LI)+>

According to the EBNF associated with SGML, this DTD states that UL is an element, the double dashes assert that the element requires a start and end tag, and the element consists of at least one, and possibly more than one, list item (LI). When a user agent such as a browser parses an HTML element, it knows to look for both beginning and ending UL tags and at least one LI element contained within those tags.

Associated with the DTD for HTML 4.0 is an implied visual presentation of an unnumbered list, which is that each list item has a specified list graphic, each list item is on a separate line, and each list item lines up beneath the previous list item. However, not all user agents (such as browsers) are visual, so presentation can only be a suggestion, not a mandate.

After the first releases of the HTML specification, new elements crept into the language to provide control over page presentation. One such element is the FONT element, which controls the size, color, type, and font family of any text the element contains. The problem with using a specific tag like FONT, however, is that non-standardized tags can lead to different Web page presentations depending on the user agents.

To help differentiate an element's structure and its presentation in HTML, the W3C issued a recommendation for CSS1, or Cascading Style Sheet Level 1, a specification that provides presentation information for HTML elements.

The real advantage of HTML was that it was relatively easy to code and display, even in different browsers. The ease of HTML was directly responsible for the massive growth of the Web. If Web document access had begun with XML, chances are you wouldn't be reading this article right now and Web access would probably be limited to the scientific community. We initially needed a simple mechanism to create Web documents, and HTML was it. The very lack of flexibility was the language's strength.

Now that the Web and Internet-based technologies have matured to a certain extent, enterprise developers are increasingly demanding a way to build flexibility into documents like Web pages in order to increase their effectiveness and ease of access.

Enter XML
XML arose from a need to create more generalized markup languages without having to follow the large and complex SGML standard. The XML standard still demands that a markup language be defined as well-formed, but it makes the validation step optional, which means that an associated DTD is not required (though one can be included). Additionally, XML uses only a subset of the rules for SGML, letting developers understand the principles and implementation of the technology more quickly.

Like SGML, XML is a meta-language that provides rules to define a set of tags that can be used within a document. These tags are then used to delimit an XML entity, its attributes, and its contents, and to define the elements' syntax. These tags are read by a XML processor, which in turn provides an application with access to the entities. The application can then perform one or more actions on the XML entities.

XML processors can either be validating, which means that they make use of an associated DTD in order to ensure valid structures, or non-validating. Regardless of whether or not they are validated, XML documents can be considered to be well-formed as long as they match the XML syntax overall and as long as each entity within the document meets the syntax for a well-formed XML entity.

The main requirements for a well-formed Extensible Markup Language include the following:

  • The language may begin with a valid XML declarative statement or prolog.
  • There is one element that acts as the root element and which acts as parent to all other elements.
  • Elements are either not empty or, if they are empty, they have a "hint" encoded within the element that defines this information to the XML parser.
  • Non-empty elements must have start and ending tags.
  • All elements except the root element are contained within some element, referred to as the element's parent; all contained elements are referred to as the parent element's children.
  • Elements can contain character data, other elements, CData sections, processing instructions or comments.
  • Each parsed element within the document is well-formed.
  • Character data that may be processed as XML is enclosed within CData sections.
  • Documents can include comments, white space, and processing instructions.

Consider that a valid and well-formed XML document consists of the following EBNF format non-terminating symbols (non-terminating meaning that the symbols are themselves expanded elsewhere):

document::= prolog element Misc*

A complying document could be as simple as:

<?XML VERSION="1.0" ENCODING="UTF-8"?> <ARTICLE name="XML" author="Shelley Powers"/>

This document consists of the prolog section which includes the XML declaration ("<?XML") and includes the version number of the XML definition, as well as the encoding declaration. It also contains one element, ARTICLE, which has two attributes, NAME and AUTHOR. Since the element is an empty element, it ends with a backslash to signal to the processor that the element contains no other content. This is necessary for a non-DTD (non-validating) document. Otherwise the XML processor would not know when to look ahead in the parsing for required element content. This is one of the key features of XML: Forward processing information is embedded directly within the document, negating the necessity of creating an associated DTD.

The example just provided is a well-formed document, but not a valid one, since no DTD is provided for validation. The example also demonstrates the simplicity of XML. An even simpler version of the language would be:

<ARTICLE name="XML" author="Shelley Powers"/>

To make the document a valid one, I could have added a DTD for the ARTICLE element directly into the document, or linked to a DTD external file:

<?XML VERSION="1.0" ENCODING="UTF-8"?> <!DOCTYPE article SYSTEM "article.dtd">
<ARTICLE name="XML" author="Shelley Powers"/>

XML in action
Though the standard is relatively new, there are several XML parsers that validate whether an XML document and its associated DTD fit the rules for a valid XML document. In addition, these same parsers may return the elements within a document exposed in their tree-like form -- a form that can be used by applications.

XML is being used in the real world already. For example, Microsoft has defined an XML application it terms "Channel Definition Format," or CDF. CDF files contain entities that describe the contents of an active channel. Following the accepted technique for XML, CDF files do not contain reference to a DTD file and instead use clues embedded within the tags and tag definitions to provide forward-looking information for the XML parser.

CDF's purpose is to provide a document that defines the use of push technology at a specific Web site, including which pages are to be displayed as channels, what icons to display, what the update schedules are, etc. With this information, the XML processor provides the key elements that a channels-based application can use to control channel access on the Web site.

The following code shows the CDF file I have defined for use at my personal Web site. The root element for the file is the CHANNEL element. It is the parent element for several other elements, such as an ICON element, an ITEM element, and an ABSTRACT element. Each of the elements within the document may or may not have attributes, and a child element may in turn be the parent for another element:

<?XML VERSION="1.0" ENCODING="UTF-8"?>
<CHANNEL HREF="http://www.yasd.com/plus/index.htm"
BASE="http://www.yasd.com/plus/">
<TITLE>YASD+</TITLE>
<ABSTRACT>YASD+ pages, using the newest technologies</ABSTRACT>
<LOGO HREF="http://www.yasd.com/mm/wide_logo.gif" STYLE="IMAGE-WIDE"/>
<LOGO HREF="http://www.yasd.com/mm/logo.gif" STYLE="IMAGE"/>
<LOGO HREF="http://www.yasd.com/mm/icon.gif" STYLE="ICON"/>
<SCHEDULE>
    <INTERVALTIME DAY="1"/>
    <EARLIESTTIME HOUR="0"/>
    <LATESTTIME HOUR="12"/>
</SCHEDULE>
<ITEM HREF="http://www.yasd.com/samples/bytes/daily.htm">
    <LOGO HREF="http://www.yasd.com/mm/icon.gif" STYLE="ICON"/>
    <ABSTRACT>YASD Code Byte</ABSTRACT>
&    lt;/ITEM>
<ITEM HREF="http://www.yasd.com/samples/bytes/cheap.htm">
    <LOGO HREF="http://www.yasd.com/mm/icon.gif" STYLE="ICON"/>
    <ABSTRACT>Cheap Page Tricks</ABSTRACT>
</ITEM>
</CHANNEL>

Notice that the first line contains the XML declaration element, a version number, and an encoding declaration. The main entity within the document is the CHANNEL entity, enclosing other elements such as TITLE, ITEM, ABSTRACT, and LOGO. Each of these elements falls within the allowable XML definition for elements:

element ::= EmptyElemTag | STag content ETag
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
STag :: = '<' Name (S Attribute)* S? '>'
ETag::= '</' Name S? '>'
content ::= (element | CharData | Reference | CDSect | PI | Comment )*

Without continuing to resolve the non-terminating references, what the syntax just shown states is that each element is either an empty element, in which case it ends with a backslash/angle bracket combination ('/>'), or it has start and end tags which enclose content. A "well-formed" constraint is placed on the start and end tags in that the NAME used in both is the same. The enclosed content can include other elements, comments, processing instructions, or other well-formed XML entities. Both empty and non-empty elements can have zero or more attributes, as the following demonstrates:

<CHANNEL HREF="http://www.yasd.com/plus/index.htm"
BASE="http://www.yasd.com/plus/">
...
</CHANNEL>

or

<INTERVALTIME DAY="1"/>

Internet Explorer 4.0 has an associated XML parser that pulls the element information out of the document. IE 4.0 uses this parsed element information to create the channel for the Web site, including the two sub-channel items, as shown below:

This site has a main Web page channel, denoted by the top-level graphic, and two sub-channels, with the second sub-channel loaded into the browser.

Accessing the CDF file directly with IE 4.0 opens a dialog box asking the individual how they would like to subscribe to the site's channel, and allowing the reader to determine how and when the channel contents are downloaded to their client machine.

Other uses of XML
In addition to CDF, Microsoft and Marimba, Inc. have also proposed XML-based technology called the Open Software Description (OSD) format, which can be used to control software downloads and installations over a corporate network. A major IS expense for larger corporations, especially those that are geographically distributed, is installing and maintaining software upgrades on employees' desktops. One small upgrade to a popular piece of software can take days of planning and weeks of actual implementation (i.e., walking around to each person's desk and installing the upgrade). During the upgrade rollout, employees will have different versions of the same software, which can create problems. With OSD, software upgrades can be handled automatically using push technology, reducing both IS staff hours and logistical problems.

SGML and XML have both been used to create a Chemical Markup Language (CML) for the chemistry community. With the CML vocabulary, molecular structures can be defined within a document and the information can be either posted or transmitted. XML processors can pull out the CML elements and pass these to applications that perform actions like preparing a print-out of the information, either textually or graphically, or creating an online three-dimensional model of the information using VRML or some other 3D technology.

Netscape, Apple, and others have proposed a Meta Content Framework (MCF) created in XML that can expose a Web site structure for navigation or online exploration. MCF can be used to do such things as generating a three-dimensional site map which can be used for Web site publication and administration. The technology is currently used by Apple's ProjectX/HotSauce browser, and "Xspace"-compatible content can also be viewed using a plug-in available from Apple.

XML can also be used to define a relational database meta-language, which can then be used to describe documents containing relational database information. These same documents can be easily generated from the relational database dictionaries, which are repositories of information about the information stored in the database. The extensible markup language can then be used to create context-centered documents like "all information pertaining to any purchases, week of January 16 through January 23," rather than using the context-neutral database format. In addition, supporting information that is not part of the data in the database, like images or reference material, can be pulled into the document.

An XML processor can process this context-based data document and use the information therein to present reports, perform online research and queries, or even to create interactive three-dimensional models of the data. Instead of issuing a SQL statement such as:

select customer_name, customer_address, city, state, zip_code from customer, purchase_orderwhere purchase_order.order_id = 32245 and customer.customer_id = purchase_order.customer_id;

I could enter a three-dimensional VRML world at a purchase_order portal and scan a virtual filing cabinet for my purchase order number. Once I find it, I can open the door into another room with doors labeled "Purchase Order items" and "Customer" and open the Customer door into another room containing the information I am looking for. Best of all, the documents containing the context-based data could be generated automatically, processed automatically, and presented automatically. This means a change in the database table could be handled automatically.

Besides three-dimensional database applications, defining data in an XML document could be used as a method to convert database data in one format, such as relational data, into another format, such as object- based database records. The resources section at the end of this article has a reference to a preliminary XML representation of a relational database.

In addition, with XML processors (or XML parsers, if you prefer), the most difficult aspect of XML has already been implemented: pulling the entities out of the document.

Returning to the CDF example, not only can the XML document be used by Internet Explorer 4.0 to provide information about the structure of a Web site's channels, I can also access the XML entities using JavaScript, C++, or Java and use the information for other purposes. For example, the following JavaScript functions open a CDF file, pull out information about the elements contained within the CDF file, and print out this information in a newly opened window...

...