Creating Your Own XML Vocabularies

XML can represent virtually any kind of structured information. A coherent set of elements and attributes that addresses a particular application need is called an XML vocabulary. The elements and attributes are the "words" in the vocabulary that enable communication of information on a certain subject. An XML vocabulary can be as simple as a single element—for example a <Task>, or can contain as many elements and attributes as needed. An example document that uses the <Task> vocabulary looks like this:

<Task Name="JDeveloper 3.1">
   <Task Name="Improved XML Support">
     <Task Name="Syntax-Check XML/XSL" Dev="Steve"/>
     <Task Name="Color-Coded Editing"  Dev="Yoshi"/>
     <Task Name="Run XSQL Pages"       Dev="Bret"/>
   <Task Name="Improved Debugging Support">
     <Task Name="Remote Debugging">
       <Task Name="JServer Debugging"      Dev="Jimmy"/>
       <Task Name="Apache JServ Debugging" Dev="Liz"/>

One of the big attractions about working with XML is its low cost of admission. The specification is free to be used by anyone, and you only need a text editor to get started. One way to begin creating an XML vocabulary is to simply start typing tags in a text file as they come to mind. For example, if you've been assigned the task of managing a "Frequently Asked Questions" (FAQ) list, you might open up vi or Emacs and start typing the example shown in Figure 1.

<?xml version="1.0"?>
    <Frequent-Question Submitter="">
        <Question>Is it easy to get started with XML?</Question>

Figure 1: Creating a new XML document using Emacs

It's very useful to just prototype your vocabulary of tags by working directly on an example document. It makes the process easy to think about. As ideas pop into your head for example, "I'm going to need to keep track of who submitted each question"

just type the necessary element or attribute in your file. It doesn't have to be right the first time; just get it down and get it in there. Corrections can be made later. If you decide you like the look of a <FAQ> element more than <Frequent-Question>, go right ahead and change it! You're the boss. Just do a global search and replace in your editor, and you're done.

In honor of the eminently pragmatic William Strunk, Jr., and E. B. White (authors of the classic writing handbook, The Elements of Style), presented below are the XML elements of style, outlining the rules you must follow as you create your own documents:

  1. Begin each document with an XML declaration.
  2. Use only one top-level document element.
  3. Match opening and closing tags properly.
  4. Add comments between <!- - and --> characters.
  5. Start element and attribute names with a letter.
  6. Put attributes in the opening tag.
  7. Enclose attribute values in matching quotes.
  8. Use only simple text as attribute values.
  9. Use < and & instead of < and & for literal less-than and ampersand characters.
  10. Write empty elements as <ElementName/>.

If the XML document follows these ten basic rules, it is called a well-formed XML document.

Unicode Character Encoding

One level below the characters seen in an XML document lies their numerical representation. The XML specification defines XML documents as sequences of characters, as defined by the Unicode standard.

Quoting the Web site, "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." Given the unique 16-bit number Unicode assigns to a character, there are different approaches for representing that number physically as bytes on the disk. These different approaches are called character encoding schemes.

One encoding scheme, named UTF-16, is the most straightforward. It uses two bytes (16 bits) to represent each character. This can be inefficient, however, if the document consists largely or entirely of ASCII characters that need only values in the range of 0 - 127 to be represented.

Another, more clever scheme, named UTF-8, takes a different approach, using a single byte to represent ASCII characters and a sequence of from two to five bytes to represent other characters. UTF-8 is the default encoding scheme for an XML document if one is not specified. If the default UTF-8 character encoding is not appropriate, include an encoding attribute that says what encoding the document is using. For example, an XML document containing Japanese data might use:

<?xml version="1.0" encoding="Shift_JIS"?>

If the default UTF-8 encoding is what you want to use, you can even legally leave off the XML declaration entirely, although it is good practice to always include one.