Join David Gassner for an in-depth discussion in this video Reviewing XML terminology, part of XML Integration with Java.
- View Offline
XML developers use a common set of vocabulary and terms to refer to the different parts of the XML file. It doesn't matter which programming language you're using to work with XML. You might be using Java, or CSharp or PHP. It's all XML. If you already have a deep understanding of XML, you might want to skip this movie. But if you're new to XML or want a review, here are some of the most common terms that you'll hear. The term XML Document is sometimes used to refer to an actual file.
But is also used to refer to the entire XML body. XML is a markup language. And as with all such languages, it depends on the use of tags to describe logical parts of the XML structure. The XML architecture supports three types of tags. A begin tag, starts with an angled bracket and ends with an angled bracket, and then has the name of the element in the middle. An end tag has the same structure, but has a forward slash right after the beginning bracket. If you have a begin tag, then you must have an end tag. That's an absolute rule of XML. As markup languages go, that makes it different from HTML, which can be more forgiving. An empty tag looks like this. There is a forward slash at the end, before the closing bracket. And that means that this is an element without any content. An empty tag does not have an end tag. It’s essentially a begin tag, and an end tag, all put together in one thing. It's up to the XML APIs to parse these tags for you. And return information about the logical parts of the XML. The elements, the attributes, and so on. At the top of an XML file, you'll frequently see a bit of code that looks like this. Starting with a bracket and a question mark, then xml. Then optionally a version and an encoding value. Then ending with the question mark and the closing bracket. This is called the XML declaration. It's optional, and you can have a well formed XML packet, without the declaration. But you'll frequently see it appear. An XML element is the most common logical component of an XML document. An XML element is divided into markup and content. For example, in this XML, there's an XML declaration at the top, and then there's a customers element. This is known as the root element. It contains, or is the parent of, all the other elements in the XML file. The next element is a child of the root element. That's the customer element. And then the customer element has child elements as well. In this example, elements with names of, name and phone. To represent data, XML typically uses text nodes, or CDATA sections. Elements are said to have child elements or child content. In this example, the customer has two child elements called name and phone. And each of those has a text node as a child. These values are sometimes referred to as the child text node. And sometimes referred to as the elements content. And still other times, referred to as the text value. It depends on which API you're using. Different APIs use different terminologies, and see that text in different ways, You'll also frequently see text represented with CDATA sections. A CDATA section is typically used where you're dealing with longer text, or where the text values might have special characters. Such as, ampersands, quotes and double quotes. These types of characters can cause problems for XML. And when they're included in a text node, they have to be written out as values that are known as entities. I'll describe those in a bit. But when you wrap text in a CDATA section, the text can use any characters. A CDATA section starts with an opening bracket, and an exclamation mark. Then a square bracket, CDATA, another square bracket, then the text value. Then a couple of closing squares brackets, and a closing angular bracket. When you're using XML APIs, you don't need to type this XML text yourself. It's all handled by the API. But you as a developer, need to know how to recognize CDATA sections when you see them. Values can be stored in CDATA sections or in text notes. But they can also appear in attributes. An attribute belongs to an element, and it's always placed in the begin tag of the element. The XML specification says that attributes can appear in any order. And so when you're using XML APIs, you'll typically refer to attributes by their name, and not by their position within the begin tag. This is an example of an attribute. It has a name and a value. In XML, attribute values always must be wrapped in quotes. And that distinguishes it from HTML where you'll frequently see values entered without quotes, especially numeric values. XML documents can be validated against specifications that describe what elements can be used, and what their data types might be. And what the relationship of different elements might be to each other. There are two major architectures for describing this information. An older architecture known as a document type declaration, or a DTD. You'll find DTDs typically on older XML vocabularies. The DTD can either appear in the XML document, or more commonly can be linked to the XML document. I won't be dealing the DTDs at all in this course. But again, you should know recognize them when you see them. The more recent architecture for validating XML, is called the Schema architecture. Schemas are defined by the World Wide Web Consortium, or the W3C. But they are implemented by the tools you use to work with XML, such as Java APIs. You define a schema in an XML document by pointing to a namespace string. And then optionally, using prefixes to refer to those namespaces. Whenever you see an XML document that has something like XMLns, it's referring to a namespace. The string you see might look like a webpage, but it doesn't necessarily point to a webpage. It's simply an arbitrary string. And it's up to the XML processor to decide whether that's meaningful or not. In this course, I won't be dealing with XML validation. I'm just going to focus on reading and creating XML files, but I will touch occasionally on how to deal with XML files that have name spaces and prefixes. Here are some other important XML terms you'll hear. Encoding, refers to the text format of the XML document, or the unicode format. The most common format is UTF-8. But you'll also see XML documents that support UTF-16. Your XML processor must be able to deal with the encoding of a particular XML document. In this case the Java API that you're working with. Comments are nodes in an XML document that contain text that can be ignored. Comments are typically only for human eyes. And they look like HTML comments, starting with the angular bracket, then the exclamation and a dash, dash, and ending with dash, dash, and the closing bracket. An entity is a string that replaces a reserved character. There are five reserved characters in XML and they're not all reserved everywhere. But one very good example of a character that's always reserved, is the ampersand. The ampersand character is an illegal character unless it's wrapped in a CDATA section. So to represent it, say in a text note, you'll see it written out like this. Starting with ampersand, then amp then a semicolon. In both XML and in HTML, all entities start with an ampersand and end with a semicolon. A processing instruction, is an instruction to an XML processor such as an XML API. This is an example of a style sheet instruction that might be used by a browser, that opens a set of XML content, and then applies a style sheet. And finally, the term white space refers to spaces, tabs, and line feeds that separate elements. You'll see a lot of whitespace in most XML files. Because one of the goals of XML is to make it human readable. And when XML is all compacted together, it's a lot harder for the human eye to comprehend. That white space however is meaningless when you're trying to interpret XML as structured data. You'll see in the Java APIs for XML, that many of them let you ignore the white space automatically. But other APIs such as the older, simple API for XML will report all of that text to you unless you explicitly turn it off. So that's a review of the common XML terms that I'll use throughout this course. Again, the most common things that I'll be dealing with are elements, attributes, and CDATA sections. But the more you know about XML structure and terminology, the more effective you can be as a developer using XML in your Java applications.
- Choosing a Java-based XML API
- Reading XML as a string
- Comparing streaming and tree-based APIs
- Parsing XML with SAX
- Creating and reading XML with DOM
- Adding data to an XML document with JDOM
- Reading and writing XML with StAX
- Working with JAXB and annotated classes
- Comparing Simple XML Serialization to JAXB