Creating XML Documents: LinqXml10
The tasks in the LinqXml group are dedicated to the LINQ to XML interface,
designed for processing XML documents.
The LINQ to XML interface includes, in addition to additional extension
methods, a set of classes related to various XML components.
This set forms the XML Document Object Model (XML DOM), which we will henceforth
for brevity denote as X-DOM. The main classes included in X-DOM, as well as
concepts used when working with XML documents, are briefly described in the
preamble to the LinqXml group. Many classes included in X-DOM
have properties that return various sequences;
some methods of the classes (in particular, their constructors) can
accept sequences as their parameters. In all
situations related to processing sequences, it is possible
to use basic LINQ queries,
which are part of the LINQ to Objects interface and were studied by us when performing
tasks from the LinqBegin and LinqObj groups.
Familiarization with the capabilities of LINQ to XML naturally begins with
creating XML documents. This topic is dedicated to the first subgroup
of the LinqXml group. Let's consider the last of the tasks included in
this subgroup.
LinqXml10°. Given the names of an existing text file and an XML document to be created. Create an XML document with a root element root, first-level elements line, and processing instructions (processing instructions are child nodes of the root element). If a line of the text file starts with the text "data:", then it (without the text "data:") is added to the XML document as the data for the next processing instruction named instr; otherwise, the line is added as a child text node to the next line element.
After creating the project template for this task using the PT4Load software module,
automatic launch of
the Visual Studio environment and loading the created project into it, the screen
will display the LinqXml10.cs file. Let's present the contents of this
file:
// File: "LinqXml10"
using PT4;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml.Linq;
namespace PT4Tasks
{
public class MyTask : PT
{
// When solving tasks of the LinqXml group, the following
// additional methods defined in the taskbook are available:
// (*) Show() and Show(cmt) (extension methods) - debug output
// of a sequence, cmt - string comment;
// (*) Show(e => r) and Show(cmt, e => r) (extension methods) -
// debug output of r values, obtained from elements e
// of a sequence, cmt - string comment.
public static void Solve()
{
Task("LinqXml10");
}
}
}
Similar to the previously considered templates created for tasks in the
LinqBegin and LinqObj groups, this file contains a set of using directives and
a description of the MyTask class with the Solve function, into which the
solution to the task needs to be entered. In the comment
preceding the Solve function, variants of the Show method, intended for debug output of
sequences, are briefly described.
Note that the list of using directives contains a directive for
connecting the System.Xml.Linq namespace, with which all
classes of the X-DOM model are associated.
The created template lacks additional methods for
input-output of sequences (similar to the methods GetEnumerableInt and
GetEnumerableString and the extension method Put, used when
performing tasks of the LinqBegin group). This is because in
most tasks, it is required to process an XML document stored
in an external file and write the processing result to the same file. For
file read-write operations of XML documents, special
methods of the XDocument class are provided, which should be used
when performing tasks. The exception are tasks on creating
XML documents, the initial data for which is contained in "ordinary"
text files. When performing such tasks, as well as tasks in the
LinqObj group, the method File.ReadLines,
which provides reading all lines of the source text file into a
string sequence IEnumerable<string>, should be used. For inputting file names and additional
string data, the taskbook function GetString should, as usual, be used;
for outputting data of basic types (such output is required
only in tasks of the subgroup related to analyzing the contents of an XML document), the function Put, also
defined in the taskbook, should be used.
After running the created program template, we will see on the screen
the taskbook window containing the task description, as well as an example of
initial data and correct results:
Although the file data in the provided figure is displayed in an
abbreviated form, the indicated fragments demonstrate all the features of
this data. The source text file contains lines
that are sets of words, and some lines
start with the text "data:" (in the figure, the first
line as well as the fourth line contains this text); the resulting file contains an XML document in the
"us-ascii" encoding, including the root element root, whose child nodes
(i.e., nodes of the first level, since the root element
is considered a level zero node) are the line elements and the processing
instructions instr.
An element of an XML document necessarily has a name; it can be
represented as paired tags of the form <name>...</name> (where in place of
the ellipsis, child nodes of this element can be located) or as a
combined tag <name />. An element can also contain
attributes.
A processing instruction is a special component of an XML document,
which is enclosed in brackets of the form <? ?> (note that the
declaration of an XML document, with which it begins, is also formatted
in a similar way). The first word in these brackets is considered the name
of the processing instruction (more precisely, the name of its target application), and
the subsequent text — the data associated with the instruction. Processing instructions
are called so because they define not the content of the
XML document, but additional data intended for
programs that process this document. Of course, in the samples of XML documents
generated by the taskbook, processing instructions are used
that are not associated with any programs and have only
formal features of "real" instructions.
Data for the instr instructions should be extracted from lines of the source
file starting with the text "data:"; other lines of the file should
be used as the content of the line elements. Thus,
when performing the task, we will need to form, based on the
string sequence obtained from the source file, a
sequence of child nodes of the root element root for the
created XML document.
As already noted, the initial string sequence
is easiest to obtain using the File.ReadLines method. Starting from version 1.3 of the PT for LINQ taskbook, all
text files associated with tasks contain only characters from the ASCII set, so
when calling the ReadLines method, there is no need to specify an
additional parameter defining the character encoding (the characters will be correctly processed
using the UTF-8 encoding, which is selected by default):
var a = File.ReadLines(GetString());
To create both the XML document itself and its components,
constructors of the corresponding classes should be used. When
performing this task, we will need the following classes,
included in X-DOM: XDocument (XML document), XDeclaration
(document declaration), XElement (element), and XProcessingInstruction
(processing instruction).
An important feature of the constructors of the XDocument and XElement classes
is that they allow immediately specifying the content of the created
document (respectively, element) as a parameter list, among
which there can be parameters-sequences (in the case of
parameters-sequences, all elements of these sequences are added to the created document or element).
This
feature is possessed by some other methods of X-DOM classes as well.
Thanks to this feature, an XML document along with its content
can be built in one expression, and when forming the
content, sequences obtained using LINQ methods can be used.
This method of forming an XML document is called functional construction. Its
advantages are the brevity and clarity of the resulting code, by the
appearance of which one can easily determine the structure of the defined
XML document or any of its fragments.
As an example of functional construction of an XML document,
here is a fragment that, based on a sequence of strings a
(obtained earlier from the source text file), forms a document with a
root element root and first-level line elements with textual
values taken from the sequence a in the same order:
XDocument d = new XDocument(
new XDeclaration(null, "us-ascii", null),
new XElement("root",
a.Select(e => new XElement("line", e))));
In this fragment, first the constructor of the
XDocument class is called; in its parameter list, a constructor creating a
document declaration is specified (in it, it is sufficient to explicitly define
the second parameter, which defines the document encoding), and a constructor
creating the level zero element root. In the constructor of the root element,
first its name is specified, and then, in the subsequent parameters, — its
content. In this case, the content is a
sequence of elements (i.e., objects of type XElement),
obtained from the initial string sequence using the
projection method Select. The lambda expression of the Select method
contains a call to the constructor for the XElement object; in it, as
the name, the string constant "line" is specified, and as the
content — the string e.
The provided fragment does not specially process lines of the
initial sequence that start with the text "data:" (which
should be converted not into elements, but into processing instructions). This
shortcoming we will correct later.
It remains to save the obtained XML document under the required
name. For this, it is sufficient to use the Save method of the XDocument class,
specifying the name as a parameter:
d.Save(GetString());
When calling this method, it is not required to additionally specify the
encoding used, since it was previously specified in the declaration of the
XML document. Also note that the Save method by default formats
the XML document when saving it to a file, outputting each node on a
separate line (or on several lines, if the node is an
element containing a set of child elements) and adding at the beginning of
each line the necessary indents.
Combining the above operators, we get the first (still
not entirely correct) variant of the solution:
public static void Solve()
{
Task("LinqXml10");
var a = File.ReadLines(GetString());
XDocument d = new XDocument(
new XDeclaration(null, "us-ascii", null),
new XElement("root",
a.Select(e => new XElement("line", e))));
d.Save(GetString());
}
When running the program, a message about an
erroneous solution will be displayed in the window (to reduce the window size, the
section with the task description is hidden in it):
To obtain an XML document that meets the conditions of the task,
when creating the document d, special processing of
lines of the source file starting with the text "data:" must be provided, creating for them
instead of XElement objects, XProcessingInstruction objects. Such
processing can be implemented by including a ternary
operation in the lambda expression:
e => e.StartsWith("data:") ?
new XProcessingInstruction("instr", e.Substring(5)) :
new XElement("line", e)
When creating a processing instruction, as its data,
a substring of the string e without the initial text "data:" is specified (the number 5
defines the starting index of the substring; recall that
character indexing starts from 0).
However, when trying to compile the obtained program, a
compilation error message will be displayed. The error is that
by the provided lambda expression, the type of
elements of the returned sequence cannot be determined (some elements will
have the type XProcessingInstruction, and some — the type XElement). To
correct this error, the ternary expression should be supplemented with
a cast operation to some common type, which will then
be considered the type of elements of the returned sequence. As
such a common type, any type that is an ancestor of both the
XProcessingInstruction class and the XElement class can be specified. The closest common
ancestor of these classes is the XNode class, which is an abstract
base class for all node components of an XML document (the
XDocument class is also a descendant of the XNode class). Using the
type cast operation as, we obtain the following variant of the lambda expression
(the added fragment is highlighted in bold):
e => e.StartsWith("data:") ?
new XProcessingInstruction("instr", e.Substring(5)) :
new XElement("line", e) as XNode
Instead of the XNode type, we could use other common
ancestor classes of the XProcessingInstruction and XElement classes, for example,
XObject (the common abstract base class for nodes and attributes of an XML document)
or object (the common ancestor of all .NET classes).
Emphasize that the type cast does not change the
actual type of the sequence elements, it only ensures a
uniform interpretation of all these elements as XML nodes, thereby giving the
compiler the opportunity to determine the type of the obtained
sequence (in our case IEnumerable<XNode>).
Let's present the final variant of the solution:
public static void Solve()
{
Task("LinqXml10");
var a = File.ReadLines(GetString());
XDocument d = new XDocument(
new XDeclaration(null, "us-ascii", null),
new XElement("root",
a.Select(e => e.StartsWith("data:") ?
new XProcessingInstruction("instr", e.Substring(5)) :
new XElement("line", e) as XNode)));
d.Save(GetString());
}
After five test runs of the program, we will get a message that
the task is completed.
|