Programming Taskbook

Electronic problem book on programming

Main

PT for LINQ

Task groups

Sample solutions

XML document processing

PT for LINQ | Sample solutions | XML document processing

Creating XML Documents: LinqXml10

The tasks in the LinqXml group are dedicated to the LINQ to XML interface, designed for processing XML documents.

The LINQ to XML interface includes, in addition to additional extension methods, a set of classes related to various XML components. This set forms the XML Document Object Model (XML DOM), which we will henceforth for brevity denote as X-DOM. The main classes included in X-DOM, as well as concepts used when working with XML documents, are briefly described in the preamble to the LinqXml group. Many classes included in X-DOM have properties that return various sequences; some methods of the classes (in particular, their constructors) can accept sequences as their parameters. In all situations related to processing sequences, it is possible to use basic LINQ queries, which are part of the LINQ to Objects interface and were studied by us when performing tasks from the LinqBegin and LinqObj groups.

Familiarization with the capabilities of LINQ to XML naturally begins with creating XML documents. This topic is dedicated to the first subgroup of the LinqXml group. Let's consider the last of the tasks included in this subgroup.

LinqXml10°. Given the names of an existing text file and an XML document to be created. Create an XML document with a root element root, first-level elements line, and processing instructions (processing instructions are child nodes of the root element). If a line of the text file starts with the text "data:", then it (without the text "data:") is added to the XML document as the data for the next processing instruction named instr; otherwise, the line is added as a child text node to the next line element.

After creating the project template for this task using the PT4Load software module, automatic launch of the Visual Studio environment and loading the created project into it, the screen will display the LinqXml10.cs file. Let's present the contents of this file:

// File: "LinqXml10"
using PT4;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml.Linq;

namespace PT4Tasks
{
    public class MyTask : PT
    {
        // When solving tasks of the LinqXml group, the following
        // additional methods defined in the taskbook are available:
        // (*) Show() and Show(cmt) (extension methods) - debug output
        //       of a sequence, cmt - string comment;
        // (*) Show(e => r) and Show(cmt, e => r) (extension methods) -
        //       debug output of r values, obtained from elements e
        //       of a sequence, cmt - string comment.

        public static void Solve()
        {
            Task("LinqXml10");

        }
    }
}

Similar to the previously considered templates created for tasks in the LinqBegin and LinqObj groups, this file contains a set of using directives and a description of the MyTask class with the Solve function, into which the solution to the task needs to be entered. In the comment preceding the Solve function, variants of the Show method, intended for debug output of sequences, are briefly described.

Note that the list of using directives contains a directive for connecting the System.Xml.Linq namespace, with which all classes of the X-DOM model are associated.

The created template lacks additional methods for input-output of sequences (similar to the methods GetEnumerableInt and GetEnumerableString and the extension method Put, used when performing tasks of the LinqBegin group). This is because in most tasks, it is required to process an XML document stored in an external file and write the processing result to the same file. For file read-write operations of XML documents, special methods of the XDocument class are provided, which should be used when performing tasks. The exception are tasks on creating XML documents, the initial data for which is contained in "ordinary" text files. When performing such tasks, as well as tasks in the LinqObj group, the method File.ReadLines, which provides reading all lines of the source text file into a string sequence IEnumerable<string>, should be used. For inputting file names and additional string data, the taskbook function GetString should, as usual, be used; for outputting data of basic types (such output is required only in tasks of the subgroup related to analyzing the contents of an XML document), the function Put, also defined in the taskbook, should be used.

After running the created program template, we will see on the screen the taskbook window containing the task description, as well as an example of initial data and correct results:

Although the file data in the provided figure is displayed in an abbreviated form, the indicated fragments demonstrate all the features of this data. The source text file contains lines that are sets of words, and some lines start with the text "data:" (in the figure, the first line as well as the fourth line contains this text); the resulting file contains an XML document in the "us-ascii" encoding, including the root element root, whose child nodes (i.e., nodes of the first level, since the root element is considered a level zero node) are the line elements and the processing instructions instr.

An element of an XML document necessarily has a name; it can be represented as paired tags of the form <name>...</name> (where in place of the ellipsis, child nodes of this element can be located) or as a combined tag <name />. An element can also contain attributes.

A processing instruction is a special component of an XML document, which is enclosed in brackets of the form <? ?> (note that the declaration of an XML document, with which it begins, is also formatted in a similar way). The first word in these brackets is considered the name of the processing instruction (more precisely, the name of its target application), and the subsequent text — the data associated with the instruction. Processing instructions are called so because they define not the content of the XML document, but additional data intended for programs that process this document. Of course, in the samples of XML documents generated by the taskbook, processing instructions are used that are not associated with any programs and have only formal features of "real" instructions.

Data for the instr instructions should be extracted from lines of the source file starting with the text "data:"; other lines of the file should be used as the content of the line elements. Thus, when performing the task, we will need to form, based on the string sequence obtained from the source file, a sequence of child nodes of the root element root for the created XML document.

As already noted, the initial string sequence is easiest to obtain using the File.ReadLines method. Starting from version 1.3 of the PT for LINQ taskbook, all text files associated with tasks contain only characters from the ASCII set, so when calling the ReadLines method, there is no need to specify an additional parameter defining the character encoding (the characters will be correctly processed using the UTF-8 encoding, which is selected by default):

var a = File.ReadLines(GetString());

To create both the XML document itself and its components, constructors of the corresponding classes should be used. When performing this task, we will need the following classes, included in X-DOM: XDocument (XML document), XDeclaration (document declaration), XElement (element), and XProcessingInstruction (processing instruction).

An important feature of the constructors of the XDocument and XElement classes is that they allow immediately specifying the content of the created document (respectively, element) as a parameter list, among which there can be parameters-sequences (in the case of parameters-sequences, all elements of these sequences are added to the created document or element). This feature is possessed by some other methods of X-DOM classes as well. Thanks to this feature, an XML document along with its content can be built in one expression, and when forming the content, sequences obtained using LINQ methods can be used. This method of forming an XML document is called functional construction. Its advantages are the brevity and clarity of the resulting code, by the appearance of which one can easily determine the structure of the defined XML document or any of its fragments.

As an example of functional construction of an XML document, here is a fragment that, based on a sequence of strings a (obtained earlier from the source text file), forms a document with a root element root and first-level line elements with textual values taken from the sequence a in the same order:

XDocument d = new XDocument(
  new XDeclaration(null, "us-ascii", null),
  new XElement("root",
    a.Select(e => new XElement("line", e))));

In this fragment, first the constructor of the XDocument class is called; in its parameter list, a constructor creating a document declaration is specified (in it, it is sufficient to explicitly define the second parameter, which defines the document encoding), and a constructor creating the level zero element root. In the constructor of the root element, first its name is specified, and then, in the subsequent parameters, — its content. In this case, the content is a sequence of elements (i.e., objects of type XElement), obtained from the initial string sequence using the projection method Select. The lambda expression of the Select method contains a call to the constructor for the XElement object; in it, as the name, the string constant "line" is specified, and as the content — the string e.

Remark. When defining content in the constructor of the XElement class (and in other X-DOM methods that provide functional construction), strings are converted into the simplest type of XML document node — a text node, which in the X-DOM model corresponds to the class XText. Emphasize that strings during functional construction are not interpreted as character sequences (unlike any other parameters-sequences), but are considered as single objects.

The provided fragment does not specially process lines of the initial sequence that start with the text "data:" (which should be converted not into elements, but into processing instructions). This shortcoming we will correct later.

It remains to save the obtained XML document under the required name. For this, it is sufficient to use the Save method of the XDocument class, specifying the name as a parameter:

d.Save(GetString());

When calling this method, it is not required to additionally specify the encoding used, since it was previously specified in the declaration of the XML document. Also note that the Save method by default formats the XML document when saving it to a file, outputting each node on a separate line (or on several lines, if the node is an element containing a set of child elements) and adding at the beginning of each line the necessary indents.

Remark. When performing tasks, the possibility of automatic formatting of the XML document turns out to be very useful, as it simplifies the analysis of the obtained XML document and its comparison with the "correct" sample. In a situation where the program generates an XML document not intended for direct viewing by a person, the possibility of formatting can be disabled by specifying in the Save method the optional second parameter SaveOptions.DisableFormatting; this will reduce the size of the file containing the document.

It is possible to obtain a textual representation of an XML document without saving it to a file; for this, it is sufficient to call for an object of type XDocument the ToString method, which returns a string (this method also can have an optional parameter SaveOptions.DisableFormatting).

The Save method is also provided for objects of type XElement; it allows saving in a file the textual representation of a separate XML element. The ToString method of the XElement object returns a string with the textual representation of the XML element.

Combining the above operators, we get the first (still not entirely correct) variant of the solution:

public static void Solve()
{
  Task("LinqXml10");
  var a = File.ReadLines(GetString());
  XDocument d = new XDocument(
    new XDeclaration(null, "us-ascii", null),
    new XElement("root",
      a.Select(e => new XElement("line", e))));
  d.Save(GetString());
}

When running the program, a message about an erroneous solution will be displayed in the window (to reduce the window size, the section with the task description is hidden in it):

To obtain an XML document that meets the conditions of the task, when creating the document d, special processing of lines of the source file starting with the text "data:" must be provided, creating for them instead of XElement objects, XProcessingInstruction objects. Such processing can be implemented by including a ternary operation in the lambda expression:

e => e.StartsWith("data:") ?
  new XProcessingInstruction("instr", e.Substring(5)) :
  new XElement("line", e)

When creating a processing instruction, as its data, a substring of the string e without the initial text "data:" is specified (the number 5 defines the starting index of the substring; recall that character indexing starts from 0).

However, when trying to compile the obtained program, a compilation error message will be displayed. The error is that by the provided lambda expression, the type of elements of the returned sequence cannot be determined (some elements will have the type XProcessingInstruction, and some — the type XElement). To correct this error, the ternary expression should be supplemented with a cast operation to some common type, which will then be considered the type of elements of the returned sequence. As such a common type, any type that is an ancestor of both the XProcessingInstruction class and the XElement class can be specified. The closest common ancestor of these classes is the XNode class, which is an abstract base class for all node components of an XML document (the XDocument class is also a descendant of the XNode class). Using the type cast operation as, we obtain the following variant of the lambda expression (the added fragment is highlighted in bold):

e => e.StartsWith("data:") ?
  new XProcessingInstruction("instr", e.Substring(5)) :
  new XElement("line", e) as XNode

Remark. The as operation, having a higher priority than the ternary operation, refers only to the expression specified after the colon. However, this is sufficient for successful compilation of the program, since during the analysis of the types of returned values (now these are XProcessingInstruction and XNode) the compiler will find that the XProcessingInstruction type can be cast to the XNode type, and will perform this cast automatically, resulting in a sequence of elements of type XNode. The same result we would have achieved by specifying the text as XNode before the colon or by enclosing the entire ternary expression in parentheses.

Instead of the XNode type, we could use other common ancestor classes of the XProcessingInstruction and XElement classes, for example, XObject (the common abstract base class for nodes and attributes of an XML document) or object (the common ancestor of all .NET classes).

Emphasize that the type cast does not change the actual type of the sequence elements, it only ensures a uniform interpretation of all these elements as XML nodes, thereby giving the compiler the opportunity to determine the type of the obtained sequence (in our case IEnumerable<XNode>).

Let's present the final variant of the solution:

public static void Solve()
{
  Task("LinqXml10");
  var a = File.ReadLines(GetString());
  XDocument d = new XDocument(
    new XDeclaration(null, "us-ascii", null),
    new XElement("root",
      a.Select(e => e.StartsWith("data:") ?
        new XProcessingInstruction("instr", e.Substring(5)) :
        new XElement("line", e) as XNode)));
  d.Save(GetString());
}

After five test runs of the program, we will get a message that the task is completed.

Last revised:
01.01.2026