Working with Variables and Data Types in JavaScript

  • 6/15/2013

Using the RegExp object

Regular expressions are the syntax you use to match and manipulate strings. If you’ve heard of or worked with regular expressions before, don’t be alarmed. Regular expressions have an unnecessarily bad reputation solely because of their looks. And, lucky for me, we shouldn’t judge things solely on looks alone. With that said, if you’ve had a bad experience with regular expressions, I’d ask that you read through this section with an open mind and see whether my explanation helps clear up some confusion.

The primary reason that I have confidence in your ability to understand regular expressions is that you’re a programmer, and programmers use logic to reduce problems to small and simple pieces. When writing or reading a regular expression, the key is to reduce the problem to small pieces and work through each.

Another reason to have confidence is that you’ve probably worked with something close to regular expressions before, so all you need to do is extend what you already know. If you’ve worked with a command prompt in Microsoft Windows or with the shell in Linux/Unix, you might have looked for files by trying to match all files using an asterisk, or star (*) character, as in:

dir *.*

or:

dir *.txt

If you’ve used a wildcard character such as the asterisk, you’ve used an element akin to a regular expression. In fact, the asterisk is also a character used in regular expressions.

In JavaScript, regular expressions are used with the RegExp object and some syntax called regular expression literals. These elements provide a powerful way to work with strings of text or alphanumerics. The ECMA-262 implementation of regular expressions is largely borrowed from the Perl 5 regular expression parser. Here’s a regular expression to match the word JavaScript:

var myRegex = /JavaScript/;

The regular expression shown would match the string “JavaScript” anywhere that it appeared within another string. For example, the regular expression would match in the sentence “This is a book about JavaScript,” and it would match in the string “ThisIsAJavaScriptBook,” but it would not match “This is a book about javascript,” because regular expressions are case sensitive. (You can change this, as you’ll see later in this chapter.)

With that short introduction you’re now prepared to look at regular expressions in more detail. The knowledge you gain here will prepare you for the remainder of the book, helping you not only understand how to work with strings in JavaScript but also understand how to use regular expressions in other languages. This section provides a reference for regular expression syntax and shows a couple simple examples.

The syntax of regular expressions

Regular expressions have a terse—and some would argue cryptic—syntax. But don’t let terse syntax scare you away from regular expressions, because in that syntax is power. This is a brief introduction to regular expressions. It’s not meant to be exhaustive. (There are entire books on regular expressions.) However, you’ll find that this gentle introduction will serve you well for the remainder of the book. Don’t worry if this material doesn’t sink in on the first read through. There are multiple tables that make it easy to use as a reference later.

The syntax of regular expressions includes several characters that have special meaning, including characters that anchor the match to the beginning or end of a string, a wildcard, and groups of characters, among others.

Table 4-6 shows several of the special characters.

Table 4-6 Common special characters in JavaScript regular expressions

Character

Description

^

Sets an anchor to the beginning of the input.

$

Sets an anchor to the end of the input.

.

Matches any character.

*

Matches the previous character zero or more times. Think of this as a wildcard.

+

Matches the previous character one or more times.

?

Matches the previous character zero or one time.

()

Places any matching characters inside the parentheses into a group. This group can then be referenced later, such as in a replace operation.

{n, }

Matches the previous character at least n times.

{n,m}

Matches the previous character at least n but no more than m times.

[ ]

Defines a character class to match any of the characters contained in the brackets. This character can use a range like 0–9 to match any number or like a–z to match any letter.

[^ ]

The use of a caret within a character class negates that character class, meaning that the characters in that class cannot appear in the match.

\

Typically used as an escape character, and meaning that whatever follows the backslash is treated as a literal character instead of as having its special meaning. Can also be used to define special character sets, which are shown in Table 4-7.

In addition to the special characters, several sequences exist to match groups of characters or nonalphanumeric characters. Some of these sequences are shown in Table 4-7.

Table 4-7 Common character sequences in JavaScript regular expressions

Character

Match

\b

Word boundary.

\B

Nonword boundary.

\c

Control character when used in conjunction with another character. For example, \cA is the escape sequence for Control-A.

\d

Digit.

\D

Nondigit.

\n

Newline.

\r

Carriage return.

\s

Single whitespace character such as a space or tab.

\S

Single nonwhitespace character.

\t

Tab.

\w

Any alphanumeric character, whether number or letter.

\W

Any nonalphanumeric character.

And finally, in addition to the characters in Table 4-7, you can use the modifiers i, g, and m. The i modifier specifies that the regular expression should be parsed in a case-insensitive manner, while the g modifier indicates that the parsing should continue after the first match, sometimes called global or greedy (thus the g). The m modifier is used for multiline matching. You’ll see an example of modifier use in an upcoming example.

The RegExp object has its own methods, including exec and test, the latter of which tests a regular expression against a string and returns true or false based on whether the regular expression matches that string. However, when working with regular expressions, using methods native to the String type, such as match, search, split, and replace, is just as common.

The exec() method of the RegExp object is used to parse the regular expression against a string and return the result. For example, parsing a simple URL and extracting the domain might look like this:

var myString = "http://www.braingia.org";
var myRegex = /http:\/\/\w+\.(.*)/i;
var results = myRegex.exec(myString);
alert(results[1]);

The output from this code is an alert showing the domain portion of the address, as shown in Figure 4-6.

Figure 4-6

Figure 4-6 Parsing a typical web URL using a regular expression.

A breakdown of this code is helpful. First you have the string declaration:

var myString = "http://www.braingia.org";

This is followed by the regular expression declaration and then a call to the exec() method, which parses the regular expression against the string found in myString and places the results into a variable called results.

var myRegex = /http:\/\/\w+\.(.*)/i;
var results = myRegex.exec(myString);

The regular expression contains several important elements. It begins by looking for the literal string http:. The two forward slashes follow, but because forward slashes (/) are special characters in regular expressions, you must escape them by using backslashes (\),making the regular expression http:\/\/ to this point.

The next part of the regular expression, \w, looks for any single alphanumeric character. Web addresses are typically www, so don’t be confused into thinking that the expression is looking for three literal ws—the host in this example could be called web, host1, myhost, or www, as shown in the code you’re examining. Because \w matches any single character, and web hosts typically have three characters (www), the regular expression adds a special character + to indicate that the regular expression must find an alphanumeric character at least once and possibly more than once. So now the code has http:\/\/\w+, which matches the address http://www right up to the .braingia.org portion.

You need to account for the dot character between the host name (www) and the domain name (braingia.org). You accomplish this by adding a dot character (.), but because the dot is also a special character, you need to escape it with \.. You now have http:\/\/\w+\., which matches all the elements of a typical address right up to the domain name.

Finally, you need to capture the domain and use it later, so place the domain inside parentheses. Because you don’t care what the domain is or what follows it, you can use two special characters: the dot, to match any character; and the asterisk, to match any and all of the previous characters, which is any character in this example. You’re left with the final regular expression, which is used by the exec() method. The result is placed into the results variable. Also note the use of the i modifier, to indicate that the regular expression will be parsed in a case-insensitive manner.

If a match is found, the output from the exec() method is an array containing the last characters matched as the first element of the array and an index for each captured portion of the expression.

In the example shown, the second element of the array (1) is sent to an alert, which produces the output shown in Figure 4-6.

alert(results[1]);

That’s a lot to digest, and I admit this regular expression could be vastly improved with the addition of other characters to anchor the match and to account for characters after the domain as well as non-alphanumerics in the host name portion. However, in the interest of keeping the example somewhat simpler, the less-strict match is shown.

The String object type contains three methods for both matching and working with strings and uses regular expressions to do so. The match, replace, and search methods all use regular expression pattern matching. Because you’ve learned about regular expressions, it’s time to introduce these methods.

The match method returns an array with the same information as the Regexp data type’s exec() method. Here’s an example:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com/;
var checkMatch = emailAddr.match(myRegex);
alert(checkMatch[0]); //Returns .com

This can be used in a conditional to determine whether a given email address contains the string .com:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com/;
var checkMatch = emailAddr.match(myRegex);
if (checkMatch !== null) {
    alert(checkMatch[0]); //Returns .com
}

The search method works in much the same way as the match method but sends back only the index (position) of the first match, as shown here:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com/;
var searchResult = emailAddr.search(myRegex);
alert(searchResult); //Returns 17

If no match is found, the search method returns -1.

The replace method does just what its name implies—it replaces one string with another when a match is found. Assume in the email address example that I want to change any .com email address to a .net email address. You can accomplish this by using the replace method, like so:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com$/;
var replaceWith = ".net";
var result = emailAddr.replace(myRegex,replaceWith);
alert(result); //Returns suehring@braingia.net

If the pattern doesn’t match, the original string is placed into the result variable; if it does, the new value is returned.

Later chapters show more examples of string methods related to regular expressions. Feel free to use this chapter as a reference for the special characters used in regular expressions.

References and garbage collection

Some types of variables or the values they contain are primitive, whereas others are reference types. The implications of this might not mean much to you at first glance—you might not even think you’ll ever care about this. But you’ll change your mind the first time you encounter odd behavior with a variable that you just copied.

First, some explanation: objects, arrays, and functions operate as reference types, whereas numbers, Booleans, null, and undefined are known as primitive types. According to the ECMA-262 specification, other primitive types exist, such as Numbers and Strings, but Strings aren’t relevant to this discussion.

When a number is copied, the behavior is what you’d expect: The original and the copy both get the same value. However, if you change the original, the copy is unaffected. Here’s an example:

// Set the value of myNum to 20.
var myNum = 20;
// Create a new variable, anotherNum, and copy the contents of myNum to it.
// Both anotherNum and myNum are now 20.
var anotherNum = myNum;
// Change the value of myNum to 1000.
myNum = 1000;
// Display the contents of both variables.
// Note that the contents of anotherNum haven't changed.
alert(myNum);
alert(anotherNum);

The alerts display 1000 and 20, respectively. When the variable anotherNum gets a copy of myNum’s contents, it holds on to the contents no matter what happens to the variable myNum after that. The variable does this because numbers are primitive types in JavaScript.

Contrast that example with a variable type that’s a reference type, as in this example:

// Create an array of three numbers in a variable named myNumbers.
var myNumbers = [20, 21, 22];
// Make a copy of myNumbers in a newly created variable named copyNumbers.
var copyNumbers = myNumbers;
// Change the first index value of myNumbers to the integer 1000.
myNumbers[0] = 1000;
// Alert both.
alert(myNumbers);
alert(copyNumbers);

In this case, because arrays are reference types, both alerts display 1000,21,22, even though only myNumbers was directly changed in the code. The moral of this story is to be aware that object, array, and function variable types are reference types, so any change to the original changes all copies.

Loosely related to this discussion of differences between primitive types and reference types is the subject of garbage collection. Garbage collection refers to the destruction of unused variables by the JavaScript interpreter to save memory. When a variable is no longer used within a program, the interpreter frees up the memory for reuse. It also does this for you if you’re using Java Virtual machine or .NET Common Language Runtime.

This automatic freeing of memory in JavaScript is different from the way in which other languages, such as C++, deal with unused variables. In those languages, the programmer must perform the garbage collection task manually. This is all you really need to know about garbage collection.