Implement data access
- 10/11/2018
- Skill 4.1: Perform I/O operations
- Skill 4.2: Consume data
- Skill 4.3: Query and manipulate data and objects by using LINQ
- Skill 4.4: Serialize and deserialize data by using binary serialization, custom serialization, XML Serializer, JSON Serializer, and Data Contract Serializer
- Skill 4.5: Store data in and retrieve data from collections
- Thought experiments
- Thought experiment answers
- Chapter summary
Thought experiment answers
This section provides the solutions for the tasks included in the thought experiments.
1 Perform I/O operations
It is not possible for a program to create an instance of the Stream type, because the Stream class is defined as abstract and intended to be used as the base class for child classes that contain implementations of the behaviors described by Stream.
The file pointer value is managed by a stream and specifies the position at which the next input/output transaction will be performed. When reading a file, the file pointer starts at the beginning of the file. If a program wishes to “skip” some locations in the file the file pointer can be updated to the new location. It is faster to update the file pointer than it is to move down a file by reading form it. When writing into a file the file pointer is updated as the file is written. When a file is opened in the “append” mode the file pointer is moved to the end of the file after the file has been opened.
Unicode is a mapping between values and text characters. As an example, the character π has the Unicode value 120587. The UTF8 standard maps Unicode characters onto one or more 8-bit storage locations. The UTF32 standard maps Unicode characters onto one or more 32-bit storage locations.
The file system on a computer does not make a distinction between a “text” file and a “binary” file. This distinction is made by the programs using the file system to store files. In the case of C#, a text file is one that contains values that represent text. The values in the text file will be encoded using a standard that maps numeric values onto characters. The Unicode standard is frequently used for this.
If a stream is opened for ReadWrite access, the program can both read from and write into the file. The file will usually hold fixed length records, so that a single record in the middle of a file can be changed without affecting any of the other records in the file.
The TextWriter class is an abstract class that specifies operations that can be performed to write text into a stream. The StreamWriter class is a child of the TextWriter class that can be instantiated and used to write text into a stream.
A program can construct a stream from a stream to allow two streams to be “chained” together. The output of one stream can then be sent to the input of the next. The example used was that we could use a compression stream in conjunction with a file stream to create a stream that would compress data as it was written into a file.
The File class provides a set of very useful methods that can be used to create and populate files with only a small number of statements. It also provides a number of very useful file management commands.
The precise way in which a given file system works is completely hidden from the programs that are using it. There is need to recompile a program to change to a different files system.
If a program tries to delete a directory that is not empty, the delete action will fail with an exception. The program must delete all the files in a directory before the directory itself can be removed.
An absolute file path will start with the drive letter on which the file is stored and contain all the directories to be traversed to reach the file. A relative path will not start with a drive letter.
Programs frequently have to create filenames by adding drives, directory names, and filenames together. The Path class provides a static Combine method that makes sure that all the path separators are correctly inserted into the path that is created. This is much less prone to error than trying to make sure that the correct number of “backslash” characters has been included in the name.
HTTP stands for “Hyper Text Transfer Protocol.” It defines the way that a client can give instructions to a web server. HTML stands for “Hyper Text Markup Language.” It defines the format of documents that are sent back from the server in response to an HTTP request.
The HttpWebRequest class is extremely customizable and provides much more flexibility than the WebClient class when creating requests to be sent to web servers. Some forms of web request can only be sent using an HttpWebRequest.
The WebClient and HTTPClient classes support the use of async and await for making asynchronous web requests.
It is a good idea to perform file operations asynchronously because they are frequently the slowest actions performed on a computer. Writing to a physical disk takes much longer than calculations, and disk transactions may take much longer if the system is busy.
2 Consume data
The idea behind a database server is that it can provide access to a central store of data. It is a bad idea for each user to have their own copy of the data because the copies can become updated in different ways. However, it is possible for a system to keep a local copy of a database for use when the system is not connected to a network and then apply the updates to a database when network connectivity is available.
A database server is designed to ensure that the stored data is always consistent. If two users update a particular record at the same time the updates will be performed in sequence and the server will make sure that data is not corrupted. It is possible for a user of a database to flag actions as atomic so that they are either completed or the database is “rolled back” to the state it had before the action.
If the ID of an item is being used as a “key” to uniquely identify the item, it is impossible to create two items with the same ID. The database will generate ID values for items automatically when they are added to the database.
The database connection string is used to create a connection to the database. If the database is not on the same computer as the program using it, the connection string will include the network address of the server and authentication information. Armed with this information anyone can create a connection to the database and send it SQL commands, which would be a bad thing.
SQL queries are the lowest level of communication with the database. However, you have seen that in an ASP.NET application you can perform actions on objects in the program and then update the contents of the database with the new objects.
An SQL command is a construction that contains command and data elements. An SQL injection attack is performed by “injecting” SQL commands into the data parts of the command. For example, if the user of a database is asked to enter a new name for the customer, they can add malicious SQL commands to the name text. If this text is used to build the command sent to the database, these commands are obeyed by the database. These attacks can be prevented by using parameterized SQL commands, which specifically separate the data from the command elements.
If your application needs to be able to upload data to a server application it can use the Representational State Transfer (REST) model. This makes use of the HTTP command set to allow an application to send data between the client and the server. The data can be XML or JSON documents.
A web service provides the client application with a “proxy object” that represents a connection to the service. The client can call methods on the proxy to interact with the service. This provides a much richer set of options than simply downloading a file.
3 Query and manipulate data and objects by using LINQ
The phrase “Language Integrated Query” refers to the way that a data query can be expressed using language elements in a format called “query comprehension”. The LINQ operators (from, show, join, etc.) allow the programmer to express their intent directly, without having to make a chain of method calls to create a query. That said, we know that the compiler actually converts “query comprehension” notation queries into a chain of method calls, and programmers can create “method-based” queries if they prefer.
LINQ doesn’t create any new data manipulation features, but it does make it much, much easier for a programmer to express what they want to do.
Building your own SQL queries is slightly more efficient than using LINQ. The very first time a LINQ query is used it will be compiled into the method sequence that is then called to deliver the result. This is rarely a problem from a performance point of view, however, and LINQ makes programmers more efficient.
When you declare a variable of type var you are asking the compiler to infer the type of the variable from the value being assigned to it. If the declaration does not assign an initial value to the variable (as in this case) the compiler will refuse to compile the statement because it has no value from which it can infer the type of the variable.
An anonymous type is a type that has no name. This is not a very helpful thing to say, but it is true. Normally a type is defined and then the new keyword is used to make new instances of that type. In the case of an anonymous type the instance is created without the associated type. The object initializer syntax (which allows a programmer to initialize public properties of a class when an instance is created) also allows objects to be created without an associated type. In the case of LINQ these objects can be returned as the results of queries. The objects contain custom result types that exactly match the data requested. Note that these objects must be referred to using the var keyword.
A LINQ query will return a result that describes an iteration that can then be processed using a foreach construction. If the query result contains many result values, it takes a long time (and uses up a lot of memory) to actually generate that result when the query is performed. So instead, the iteration is evaluated and each result is generated in turn when it is to be consumed.
The group operator creates a new dataset grouped around the value of one item in the input dataset. You used it in the MusicTracks application to take a list of all the tracks (each track containing an ArtistID value) and create a group that contains one entry for each ArtistID value in the dataset. You can then use aggregate operations on the group to do things such as count the number of tracks and sum the total length of the tracks for each artist.
The take behavior creates a LINQ query that takes a particular number of items from the source dataset. The skip behavior skips down the dataset a given number of items before the query starts taking values. Used in combination they allow a program to work through a dataset one “page” at a time.
The “query comprehension” and “method-based” LINQ query formats can be used interchangeably in programs. You can use them both in a single solution, depending on which is most convenient at any given point in the program.
The XDocument class can hold an XML document, including the elements that can be used to express a “fully formed” XML document, including metadata about the document contents. An XElement can contain an element in an XML document (which can contain a tree of other XElement objects). An XDocument contains XElement objects that contain the data in the document.
The XmlDocument implements a Document Object Model (DOM) for XML documents. The XDocument builds on the ability of an XmlDocument to allow it to work with LINQ queries. The same relationship applies between XmlElement and XElement.
4 Serialize and deserialize data
Serialization is very useful if you are storing small amounts of structured data. The high score table for a game can be stored as a serialized object. Another use for a small serialized object can be the settings for an application. It would be less sensible to store a large data structure as a serialized object, particularly if the object has to be repeatedly updated.
Serialization takes a snapshot of the data elements in a particular type. During the deserialization process a new instance of the serialized type is created. This process requires the type to be deserialized to be available on the receiving machine. In the case of binary serialization, the binary file contains type information that is compared with type information in the destination class. If this information doesn’t match, the deserialization fails. Note that in the case of serialization to XML and JSON text files, this matching does not take place. You can regard these two types as being more portable, at the expense of security. The DataContract serialization process allows the serialized object to contain type information that can be checked during deserialization, but the format of the serialized file is still human readable XML.
Binary serialization can store any type of data, but XML serialization is restricted to numeric and text types.
A binary serializer can store reference types as references, whereas XML, JSON, and DataContract serialization will resolve references to obtain values that are then serialized.
Serialization does not store the methods in a class, or any static elements. Binary serialization will store private data members of a type. DataContract serialization can store private data members as XML text.
Binary serialization is very useful for taking a complete snapshot of the data content of a class. It forces the serialize and deserialize process to make use of identical classes. It is very useful for transferring an object from one process to another, where the serialized data stream will not be persisted. I’m not keen on using it to persist data for long periods of time because it is vulnerable to changes in the classes used. Text serialization such as XML is very useful if you want to transfer data from one programming language or host to another. It is highly portable. It is also very useful for storing small amounts of structured data.
Binary serialization produces a stream that represents the entire contents of a class, including all private content. However, with an understanding of the content, it is possible that this can be compromised, allowing the private contents of an object to be viewed and changed. Just because you can’t view the contents of a binary serialized object with a text editor does not mean that it is not immune to tampering.
A custom serializer allows the programmer to get control of the serialization process either by creating their own serialization process or by getting control during the phases of the serialization process. These customizations are only possible when using binary data serialization.
All serializers use data streams to transfer data being serialized and deserialized. In Skill 3.2, in the “Encrypting data using AES symmetric encryption,” section you saw that it is possible to send a data stream through an encrypting stream, making it possible to encrypt serialized data. It is also possible to compress serialized data in the same way.
Both XML and DataContract serializers produce XML output. In the case of XML, if a class is marked as serializable, all of the data elements in the class will be serialized and the programmer must mark as NonSerialized any data members that should not be serialized. In the case of DataContract serialization, the programmer must mark elements to be serialized. The other functional difference is that private data members can be serialized using DataContract serialization.
5 Store data in and retrieve data from collections
A database provides storage where database queries are used to manipulate the data stored. The data in the database is moved into the program for processing. A collection is stored in the memory of the computer and can therefore be accessed much more quickly. A database is good for very large amounts of data that won’t necessarily fit in memory and have to be shared with multiple users. In-memory collections have performance many times that of data access from a database, but are limited in capacity to the memory of the computer. One major attraction of a database is the ease with which a database query can be used to extract data. However, you should remember that LINQ can be used on in-memory collections.
You can create an array of arrays. Each of the arrays in the array can consist of a different length, leading to the creation of what is called a “jagged” array.
The C# compiler will not complain if you make a twenty-dimensional array. However, it might use up a lot of computer memory, and it would certainly be very hard to visualize. Don’t confuse adding array dimensions with adding properties to an object. To hold the name, birthday and address of a person you don’t need a three-dimensional array, you need a one-dimensional array of Person elements.
Remember that C# arrays are indexed starting at 0. This means that if you have an array with 4 elements they are given the indices 0, 1, 2, and 3. In other words there is no element with the index value 4. This can be counter-intuitive, and it is also not how some other languages work, where array indices start at 1.
It is impossible to change the size of an array once it has been created. The only way to “add” an element is to create a new, larger, array and then copy the existing one into it.
The use of Length for the length of an array and Count for the number of elements in other collection types can be confusing, but the explanation is that an array has the same size at all times, so you can just get the length of it. However, a dynamic collection class such as an ArrayList can change in size at any time, and so the program will actually have to count the number of items to find out the current size.
Dictionaries decide where to store an item by using a hashing algorithm. You saw hashing in Skill 3.2, where a hashing function reduces a large amount of data to a single, smaller value that represents that data. The Dictionary class uses a hashing algorithm to convert the key value for the item being stored into a number that will give the location of the item. When searching for the location represented by a key, the dictionary doesn’t have to search through a list of keys to find the one selected. It just has to hash the supplied key value to calculate exactly there the item is stored.
Sets are useful if you want to item properties that may grow and change over time, such as with a tag metadata. A user can generate new tags as the application is used, and the set provides operators that can search for items. The difficulty with tags is that it may be difficult to store them in fixed sized storage such as databases. Furthermore, LINQ operators can be used in place of set operations.
The prime difference between a stack and a queue is how the order of items is changed when they are pushed and popped. A queue retains the order, so the first item added to the queue is the first one to be removed from the queue. A stack reverses the order, so the first item to be pushed onto the stack will be the last one to be removed. If you think about it, you can reverse the order of a collection by pushing all of the elements onto a stack and then popping them off.
You can use the List type to meet all of your data storage needs, but you must put in substantial extra amounts of work to get a list to perform like a set or a dictionary.