CSE691/891 - Internet Programming Summer 2001
prev next

XML Syntax Rules

The body of an XML document is a tree of element nodes with a single root. Each element node is a tagged structure of Unicode characters.

element syntax is:

<tagName *[attributeName="value"]> element body </tagName>
TagNames are user defined (with two exceptions we will discuss later), as are attributes.
Unlike HTML, tagNames and attributes are case sensitive.
XML names are composed of Unicode characters.
- TagNames must begin with a letter or underscore.
- Other tagName characters may contain characters, underscores, digits, hypens, and periods.
- Names may not start with the letters xml, including any variations on case. These names are reserved for use by W3C, as in the document header, at the top of this page's source file.
- Attributes are a property of the element, as opposed to the data it contains, carried in the element body. Attribute names follow the same rules as tagNames, and are also required to be unique, within the tag in which they are embedded. That is, an attribute name may not appear more than once in a tag.
Element bodies contain character data.
- Character data is any data that is not markup, e.g., stuff inside tags.
- The characters &, <, >, ', and " are markup delimiters and may not appear in character data. These may be represented by the five escape sequences defined for XML, e.g., "&amp", "&lt", "&gt", "&apos", and "&quot", respectively.
- CDATA sections are a way to pass data that contains markup delimiters without using escape sequences. The XML parser will not interprete characters in a CDATA section, but simply pass it along to the application. The syntax for a CDATA section is:
  
  <! [CDATA[...]]>
  Note that CDATA sections will not work for passing binary data to an application, as it is possible that the binary data contain a bit sequence interpreted by the parser as "]]>". This would cause termination of the CDATA section before the binary data was completely digested.
To pass binary data, you must convert it to a character representation. Converting to hexadecimal representation is easy, but doubles memory required to hold the data. The Simple Object Acess Protocol (SOAP) parser uses a conversion scheme that expands binary data by about 30%.
A well-formed XML document has:
- An optional prolog:
  - Starts with the line:
    
    <?xml version="1.0">
    This identifies the file contents as belonging to an XML document.
  - Processing instructions (more on this later).
  - A reference to a DTD or schema, used for validation.
  Stand-alone documents should include the first item. The second two are strictly optional. XML data islands, used as part of a larger document (perhaps HTML), will not use the first item either.
- A body with a single root node. The body of this root node may contain one or more elements, as may the bodies of any descendent node.
- An optional epilogue consisting of comments and processing instructions.