Programming in HTML5 with JavaScript and CSS3: Access and Secure Data

  • 8/25/2014

Objective 3.2: Validate user input by using JavaScript

The new HTML controls discussed in Objective 3.1 provide some great functionality to validate user data. However, this functionality has some limitations. This is where further validation performed in JavaScript comes in handy. JavaScript provides additional functionality that’s not readily available in the core HTML controls. Although some controls aren’t yet available in all browsers, you might need to validate user input such as dates, telephone numbers, or alphanumeric postal codes. This objective demonstrates how to use regular expressions to validate the input format and how to use the JavaScript built-in functions to ensure that data is the correct data type. This objective also adds a layer of security by demonstrating how to prevent malicious code injection.

Evaluating regular expressions

You saw the use of regular expressions in Objective 3.1. In fact, the core HTML input controls support a pattern attribute that allows you to apply a regular expression to validate user input. In some cases, though, validating user input in JavaScript can be more effective than inline with attributes. This section introduces regular expressions. The basic syntax of a regular expression is explained, as is how to use the expression in JavaScript.

Regular expressions have a unique syntax of their own. They can be daunting to use but can also be very powerful. Although a full instruction on regular expressions is beyond the scope of this book, a brief introduction is provided to support the later examples.

Regular expressions are a mix of special characters and literal characters that make up the pattern that someone would want to match. Table 3-1 lists the special characters and their meaning.

TABLE 3-1 Regular expression special characters

Symbol

Description

^

The caret character denotes the beginning of a string.

$

The dollar sign denotes the end of a string.

.

The period indicates to match on any character.

[A-Z]

Alphabet letters indicate to match any alphabetic character. This is case-sensitive. To match lowercase letters, use [a-z].

\d

This combination indicates to match any numeric character.

+

The plus sign denotes that the preceding character or character set must match at least once.

*

The asterisk denotes that the preceding character or character set might or might not match. This generates zero or more matches.

[^]

When included in a character set, the caret denotes a negation. [^a] would match a string that doesn’t have an ‘a’ in it.

?

The question mark denotes that the preceding character is optional.

\w

This combination indicates to match a word character consisting of any alphanumeric character, including an underscore.

\

The backslash is an escape character. If any special character should be included in the character set to match on literally, it needs to be escaped with a \. For example, to find a backslash in a string, the pattern would include \\.

\s

This combination indicates to match on a space. When it’s combined with + or *, it can match on one or more spaces.

This list encompasses the main functions available when string matching with regular expressions. Building regular expressions requires taking the definition of those characters and essentially creating a mask out of them to be used by the regular expression engine to interpret and decide whether there is a match. For example, a Canadian postal code is comprised of the format A1A 1A1—that is, alternating alphabetic characters and numeric characters with a space in the middle. Some characters aren’t used in postal codes because the machines confuse them with other characters (for example, Z and 2). Also, the space isn’t mandatory. When you need to enforce the data format of the user input, deciding how you want the data to be captured and how flexible you want it to be is important. Then build your regular expression to match this.

Now, build the regular expression for a postal code. You first need to denote the beginning of the string, because it helps eliminate unnecessary white space at the lead of the string:

^

The first part of the expression is the caret. The next character must be alphabetic:

^[A-Z,a-z]

Because postal codes aren’t case sensitive, the expression allows the first character to be either uppercase or lowercase. The next character in the postal code must be a digit:

^[A-Z,a-z]\d

Because the postal code accepts all digits 0-9, \d is used to specify any digit. However, [0-9] could have been used as well. And now the pattern continues, letter-number-letter number-letter-number:

^[A-Z,a-z]\d[A-Z,a-z]\d[A-Z,a-z]\d

As was indicated earlier, the space in the middle of the postal code, while common convention, is optional. This is where deciding how flexible the data validation should be is required. The expression as it is won’t allow for any space in the middle because the expression is set to match on consecutive alternating letter-number-letter. Perhaps, for formatting purposes, a space should be required. In this case, \s would require that a space is included:

^[A-Z,a-z]\d[A-Z,a-z]\s\d[A-Z,a-z]\d

Now, users would be required to enter the postal code with a space in the middle of the two sets of three characters. But maybe the website doesn’t care about the space in the middle, because it doesn’t really affect anything. In this case, the \s can be denoted with the *:

^[A-Z,a-z]\d[A-Z,a-z]\s*\d[A-Z,a-z]\d

Now, the expression allows for alternating letter-number-letter and one or more spaces can occur in the middle. The space is now optional, but a problem has been introduced. The user can now enter any number of spaces and still pass the validation, such as:

A1A      1A1

That would pass the validation because one or more spaces is required by the \s*. The desired outcome here is to allow only one space or no spaces. For this, a new element is added to limit the number of occurrences to just one. This is accomplished by specifying the maximum length allowed for the character set being matched:

^[A-Z,a-z]\d[A-Z,a-z]\s{1}\d[A-Z,a-z]\d

The {1} says to match the previous character only the specified number of times—in this case, one time. Now the expression is back to functionality that’s no different than just specifying the \s. What is needed next is something to make the single space optional, as denoted with the ?. To achieve this effect, the space segment is wrapped in square brackets to make it a set and followed by the ? to make it optional:

^[A-Z,a-z]\d[A-Z,a-z][\s{1}]?\d[A-Z,a-z]\d

Now you have a regular expression that requires the correct alphanumeric pattern for a Canadian postal code with an optional space in the middle.

This simple example demonstrates the key elements to a regular expression. Although this regular expression can be placed into the pattern attribute of the <input> element, this next section discusses how to use the JavaScript framework to perform pattern matching with regular expressions.

Evaluating regular expressions in JavaScript

Just like with strings and integers, regular expressions are objects in JavaScript. As such, they can be created and can provide methods to evaluate strings. Regular expression objects are created in a similar fashion as strings; however, rather than use “ to encapsulate the expression, use the forward slash /<expression>/ instead. JavaScript knows that text surrounded by forward slashes in this way is a regular expression object. Going back to the postal code example, the following HTML is provided:

<script type="text/javascript">
     function CheckString() {
         try{
             var s = $('#regExString').val();
             var regExpression = /^[A-Z,a-z]\d[A-Z,a-z][\s{1}]?\d[A-Z,a-z]\d/;
             if (regExpression.test(s))
                 alert("Valid postal code.");
             else
                 alert("Invalid postal code.");
         } catch (e) {
             alert(e.message);
         }
     }
 </script>
 <body>
     <form>
         <input type="text" id="regExString" />
         <button onclick="CheckString();" >Evaluate</button>
     </form>
 </body>

This HTML provides a very basic page with a text box and a button. The button does nothing more than call a function to validate whether the entered text matches the format desired for a postal code. This page shouldn’t contain anything that you haven’t seen already, except the line in which the regular expression object is created:

var regExpression = /^[A-Z,a-z]\d[A-Z,a-z][\s{1}]?\d[A-Z,a-z]\d/;

With this line, a regular expression object is created and, as a result, methods are available. The string is extracted from the text box and passed to the test method of the regular expression. The test method returns a Boolean to indicate whether the input string matches the regular expression that was created.

The regular expression object also provides a method called exec. This method returns the portion of the input string that matches the expression. The following code example illustrates this by adding another button and function to use the exec method instead of test:

function CheckStringExec() {
         var s = $('#regExString').val();
         var regExpression = /^[A-Z,a-z]\d[A-Z,a-z][\s{1}]?\d[A-Z,a-z]\d/;
         var results = regExpression.exec(s);
         if(results != null)
             alert("Valid postal code." + results[0]);
         else
             alert("Invalid postal code.");
...
<button onclick="CheckStringExec();" >Evaluate with Exec</button>

With this button, the expression is evaluated just like it was with the test method, except the match is returned as a string array. That the return result is a string array is important to note because using regular expressions can result in multiple matches. If a match isn’t made, the return result will be null. In this example, the results are evaluated by checking whether the array isn’t null; if it’s not, the postal code is valid and shown back to the user. If the match isn’t made, the return value will be null.

The string object also provides regular expression methods. The string could be used directly to evaluate the expression. The string provides the search and match methods. The search method returns the index of the character in the string where the first match occurred. The match method returns the part of the string that matches the pattern, much like the exec method. In addition to these two methods, many of the other string methods accept a regular expression object, such as indexOf, split, and replace. This provides some advanced functionality for manipulating strings in JavaScript.

Although regular expressions provide a great deal of power in evaluating strings for patterns and ensuring that the data is in the desired format, JavaScript also provides built-in functions to evaluate the type of data received.

Validating data with built-in functions

JavaScript provides built-in functions to evaluate data type. Some functions are provided directly within JavaScript; others are provided by the jQuery library.

The isNaN function provides a way to evaluate whether the value passed into it isn’t a number. If the value isn’t a number, the function returns true; if it is a number, it returns false. If the expected form of data being evaluated is numeric, this function provides a defensive way to determine this and handle it appropriately:

if (isNan(value)) {
    //handle the non number value
}
else {
    //proceed with the number value
}

The opposite of the isNaN function is the isFinite function. The isFinite function is used in the same way but returns true if the value is a finite number and false if it’s not.

Being able to validate data is very important as previously outlined. Equally important to validating the data explicitly is ensuring that data-entry fields prevent users from injecting script. Code injection is a widely discussed topic in website security. The next section discusses preventing code injection.

Preventing code injection

Code injection is a technique that attackers use to inject JavaScript code into your webpage. These attacks usually take advantage of dynamically created content to have additional script run so that malicious users can try to gain some sort of control over the website. Their intentions can be many, but among those intentions might be to trick other site users into providing sensitive information. Depending on the content of the page, different measures need to be considered.

Protecting against user input

A web application accepting user input opens up a potential attack surface for malicious users. The size of the attack surface depends on what’s done with the entered data. If the website takes data and doesn’t do anything with it outside the scope of the current webpage, such as send it to another server or store it in a database, the effects are limited to the current page and browser session. Little can be accomplished except to disrupt the design of the website for this particular user. However, if the captured data includes an account creation form or survey, for example, a malicious user has much more potential to do harm—especially when that information is later rendered to the webpage dynamically. This inherently allows anyone to add script to the site, which can open up the site to behavior such as phishing. As a webpage developer, you need to ensure that all user input is scrubbed of script elements. For example, don’t allow < > text to be entered into the form. Without those characters, a script block can’t be added.

Using the eval function

The eval function is used to run JavaScript dynamically. It takes a string as a parameter and runs it as a JavaScript function. Never use the eval function against any data provided by an external source over which you don’t have 100 percent control.

Using iFrames

iFrames open up a new opportunity to attackers. Search engines provide a plethora of results dealing with exploits regarding the use of iFrames. The sandbox attribute should always be used to restrict what data can be placed into an iFrame. The sandbox attribute has four possible values, as listed in Table 3-2.

TABLE 3-2 Available sandbox attribute values

Value

Description

“”

An empty string applies all restrictions. This is the most secure.

allow-same-origin

iFrame content is treated as being from the same origin as the containing HTML document.

allow-top-navigation

iFrame content can load content from the containing HTML document.

allow-forms

iFrame can submit forms.

allow-scripts

iFrame can run script.

Objective summary

  • Regular expressions are strings of special characters that an interpreter understands and uses to validate text format.
  • Regular expressions are objects in JavaScript that provide methods for testing input data.
  • isNaN is a built-in function to determine whether a value isn’t a number, whereas isFinite validates whether the value is a finite number.
  • Code injection is a technique that attackers use to inject malicious code into your application.
  • iFrames and dynamic JavaScript are dangerous if not used properly in a webpage.

Objective review

  1. Which of the following regular expression characters denote the end of the string?

    1. $
    2. %
    3. ^
    4. &
  2. Which of the following sandbox attributes allows the iFrame to load content from the containing HTML document?

    1. allow-script-execution
    2. allow-same-origin
    3. allow-forms
    4. allow-top-navigation
    5. allow-top-document
  3. Which function should never be used to run JavaScript?

    1. execute
    2. JSDynamic
    3. eval
    4. evaluate