Google XML Document Format Style Guide {#google-xml-document-format-style-guide style=“text-align: center;”}toto
======================================
Version 1.0
Copyright Google 2008
This document provides a set of guidelines for general use when designing new XML document formats (and to some extent XML documents as well; see Section 11). Document formats usually include both formal parts (DTDs, schemas) and parts expressed in normative English prose.
These guidelines apply to new designs, and are not intended to force retroactive changes in existing designs. When participating in the creation of public or private document format designs, the guidelines may be helpful but should not control the group consensus.
This guide is meant for the design of XML that is to be generated and consumed by machines rather than human beings. Its rules are not applicable to formats such as XHTML (which should be formatted as much like HTML as possible) or ODF which are meant to express rich text. A document that includes embedded content in XHTML or some other rich-text format, but also contains purely machine-interpretable portions, SHOULD follow this style guide for the machine-interpretable portions. It also does not affect XML document formats that are created by translations from proto buffers or through some other type of format.
Brief rationales have been added to most of the guidelines. They are maintained in the same document in hopes that they won't get out of date, but they are not considered normative.
The terms MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used in this document in the sense of RFC 2119.
Attempt to reuse existing XML formats whenever possible, especially those which allow extensions. Creating an entirely new format should be done only with care and consideration; read Tim Bray's warningstoto
first. Try to get wide review of your format, from outside your organization as well, if possible. [Rationale: New document formats have a cost: they must be reviewed, documented, and learned by users.]
If you are reusing or extending an existing format, make sensible use of the prescribed elements and attributes, especially any that are required. Don't completely repurpose them, but do try to see how they might be used in creative ways if the vanilla semantics aren't suitable. As a last resort when an element or attribute is required by the format but is not appropriate for your use case, use some fixed string as its value. [Rationale: Markup reuse is good, markup abuse is bad.]
When extending formats, use the implicit style of the existing format, even if it contradicts this guide. [Rationale: Consistency.]
Document formats SHOULD be expressed using a schema language. [Rationale: Clarity and machine-checkability.]
The schema language SHOULD be RELAX NG [compacttoto
syntax](http://www.relaxng.org/compact-tutorial-20030326.html “compact syntax”){#ulci}. toto
Embedded [Schematron](http://www.schematron.com/ “Schematron”.html){#ymh-} rulestoto
MAY be added to the schema for additional fine control. [Rationale: RELAX NG is the most flexible schema language, with very few arbitrary restrictions on designs. The compact syntax is quite easy to read and learn, and can be converted one-to-one to and from the XML syntax when necessary. Schematron handles arbitrary cross-element and cross-attribute constraints nicely.]
Schemas SHOULD use the "Salami Slice" styletoto
(one rule per element). Schemas MAY use the "Russian Doll" styletoto
(schema resembles document) if they are short and simple. The "Venetian Blind" styletoto
(one rule per element type) is unsuited to RELAX NG and SHOULD NOT be used.
Regular expressions SHOULD be provided to assist in validating complex values.
DTDs and/or W3C XML Schemas MAY be provided for compatibility with existing products, tools, or users. [Rationale: We can't change the world all at once.]
Note: "Names" refers to the names of elements, attributes, and enumerated values.
Numeric values SHOULD be 32-bit signed integers, 64-bit signed integers, or 64-bit IEEE doubles, all expressed in base 10. These correspond to the XML Schema types xsd:int, xsd:long, and xsd:double respectively. If required in particular cases, xsd:integer (unlimited-precision integer) values MAY also be used. [Rationale: There are far too many numeric types in XML Schema: these provide a reasonable subset.]
Boolean values SHOULD NOT be used (use enumerations instead). If they must be used, they MUST be expressed as true or false, corresponding to a subset of the XML Schema type xsd:boolean. The alternative xsd:boolean values 1 and 0 MUST NOT be used. [Rationale: Boolean arguments are not extensible. The additional flexibility of allowing numeric values is not abstracted away by any parser.]
Dates should be represented using RFC 3339toto
format, a subset of both ISO 8601 format and XML Schema xsd:dateTime format. UTC times SHOULD be used rather than local times. [Rationale: There are far too many date formats and time zones, although it is recognized that sometimes local time preserves important information.]
Embedded syntax in character content and attribute values SHOULD NOT be used. Syntax in values means XML tools are largely useless. Syntaxes such as dates, URIs, and XPath expressions are exceptions. [Rationale: Users should be able to process XML documents using only an XML parser without requiring additional special-purpose parsers, which are easy to get wrong.]
Be careful with whitespace in values. XML parsers don't strip whitespace in elements, but do convert newlines to spaces in attributes. However, application frameworks may do more aggressive whitespace stripping. Your document format SHOULD give rules for whitespace stripping.
Simple key-value pairs SHOULD be represented with an empty element whose name represents the key, with the value attribute containing the value. Elements that have a value attribute MAY also have a unit attribute to specify the unit of a measured value. For physical measurements, the SI systemtoto
SHOULD be used. [Rationale: Simplicity and design consistency. Keeping the value in an attribute hides it from the user, since displaying just the value without the key is not useful.]
If the number of possible keys is very large or unbounded, key-value pairs MAY be represented by a single generic element with key, value, and optional unit and scheme attributes (which serve to discriminate keys from different domains). In that case, also provide (not necessarily in the same document) a list of keys with human-readable explanations.
Note: There are no hard and fast rules about whether binary data should be included as part of an XML document or not. If it's too large, it's probably better to link to it.
Binary data MUST NOT be included directly as-is in XML documents, but MUST be encoded using Base64 encoding. [Rationale: XML does not allow arbitrary binary bytes.]
The line breaks required by Base64 MAY be omitted. [Rationale: The line breaks are meant to keep plain text lines short, but XML is not really plain text.]
An attribute named xsi:type with value xs:base64Binary MAY be attached to this element to signal that the Base64 format is in use. [Rationale: Opaque blobs should have decoding instructions attached.]
New processing instructions MUST NOT be created except in order to specify purely local processing conventions, and SHOULD be avoided altogether. Existing standardized processing instructions MAY be used. [Rationale: Processing instructions fit awkwardly into XML data models and can always be replaced by elements; they exist primarily to avoid breaking backward compatibility.]
Note: These points are only guidelines, as the format of program-created instances will often be outside the programmer's control (for example, when an XML serialization library is being used). In no case should XML parsers rely on these guidelines being followed. Use standard XML parsers, not hand-rolled hacks.
Note: There are no hard and fast rules for deciding when to use attributes and when to use elements. Here are some of the considerations that designers should take into account; no rationales are given.
Attributes are more restrictive than elements, and all designs have some elements, so an all-element design is simplest -- which is not the same as best.
In a tree-style data model, elements are typically represented internally as nodes, which use more memory than the strings used to represent attributes. Sometimes the nodes are of different application-specific classes, which in many languages also takes up memory to represent the classes.
When streaming, elements are processed one at a time (possibly even piece by piece, depending on the XML parser you are using), whereas all the attributes of an element and their values are reported at once, which costs memory, particularly if some attribute values are very long.
Both element content and attribute values need to be escaped appropriately, so escaping should not be a consideration in the design.
In some programming languages and libraries, processing elements is easier; in others, processing attributes is easier. Beware of using ease of processing as a criterion. In particular, XSLT can handle either with equal facility.
If a piece of data should usually be shown to the user, consider using an element; if not, consider using an attribute. (This rule is often violated for one reason or another.)
If you are extending an existing schema, do things by analogy to how things are done in that schema.
Sensible schema languages, meaning RELAX NG and Schematron, treat elements and attributes symmetrically. Older and cruder toto
schema languages such as DTDs and XML Schema, tend to have better support for elements.
If something might appear more than once in a data model, use an element rather than introducing attributes with names like foo1, foo2, foo3 .…
Use elements to represent a piece of information that can be considered an independent object and when the information is related via a parent/child relationship to another piece of information.
Use elements when data incorporates strict typing or relationship rules.
If order matters between two pieces of data, use elements for them: attributes are inherently unordered.
If a piece of data has, or might have, its own substructure, use it in an element: getting substructure into an attribute is always messy. Similarly, if the data is a constituent part of some larger piece of data, put it in an element.
An exception to the previous rule: multiple whitespace-separated tokens can safely be put in an attribute. In principle, the separator can be anything, but schema-language validators are currently only able to handle whitespace, so it's best to stick with that.
If a piece of data extends across multiple lines, use an element: XML parsers will change newlines in attribute values into spaces.
If a piece of data is very large, use an element so that its content can be streamed.
If a piece of data is in a natural language, put it in an element so you can use the xml:lang attribute to label the language being used. Some kinds of natural-language text, like Japanese, often make use [annotations](https://www.w3.org/TR/2001/REC-ruby-20010531 “annotations”.html){#pa2f}toto
that are conventionally represented using child elements; right-to-left languages like Hebrew and Arabic may similarly require child elements to manage [bidirectionality](https://www.w3.org/TR/2001/REC-ruby-20010531 “bidirectionality”.html){#ehyv}toto
properly.
If the data is a code from an enumeration, code list, or controlled vocabulary, put it in an attribute if possible. For example, language tags, currency codes, medical diagnostic codes, etc. are best handled as attributes.
If a piece of data is really metadata on some other piece of data (for example, representing a class or role that the main data serves, or specifying a method of processing it), put it in an attribute if possible.
In particular, if a piece of data is an ID for some other piece of data, or a reference to such an ID, put the identifying piece in an attribute. When it's an ID, use the name xml:id for the attribute.
Hypertext references are conventionally put in href attributes.
If a piece of data is applicable to an element and any descendant elements unless it is overridden in some of them, it is conventional to put it in an attribute. Well-known examples are xml:lang, xml:space, xml:base, and namespace declarations.
If terseness is really the most important thing, use attributes, but consider gzip compression instead -- it works very well on documents with highly repetitive structures.
Use common sense and BE CONSISTENT. Design for extensibility. You are gonna need it. [Rationale: Long and painful experience.]
When designing XML formats, take a few minutes to look at other formats and determine their style. The point of having style guidelines is so that people can concentrate on what you are saying, rather than on how you are saying it.
Break ANY OR ALL of these rules (yes, even the ones that say MUST) rather than create a crude, arbitrary, disgusting mess of a design if that's what following them slavishly would give you. In particular, random mixtures of attributes and child elements are hard to follow and hard to use, though it often makes good sense to use both when the data clearly fall into two different groups such as simple/complex or metadata/data.
Newbies always ask:
"Elements or attributes?
Which will serve me best?"
Those who know roar like lions;
Wise hackers smile like tigers.
--a tanka,toto
or extended haiku
[TODO: if a registry of schemas is set up, add a link to it]