Querying Full-Text Data in Microsoft SQL Server 2012

  • 12/15/2012

Lesson 1: Creating Full-Text Catalogs and Indexes

Full-text search allows approximate searches in SQL Server 2012 databases. Before you start using full-text predicates and functions, you must create full-text indexes inside full-text catalogs. After you create full-text indexes over character columns in your database, you are able to search for:

  • Simple terms—that is, one or more specific words or phrases.

  • Prefix terms, which are terms the words or phrases begin with.

  • Generation terms, meaning inflectional forms of words.

  • Proximity terms, or words or phrases close to another word or phrase.

  • Thesaurus terms, or synonyms of a word.

  • Weighted terms, which are words or phrases that use values with your custom weight.

  • Statistical semantic search, or key phrases in a document.

  • Similar documents, where similarity is defined by semantic key phrases.

Full-Text Search Components

In order to start using full-text search, you have to understand full-text components. For a start, you can check whether Full-Text Search is installed by using the following query.

SELECT SERVERPROPERTY('IsFullTextInstalled');

If Full-Text Search is not installed, you must re-run the setup.

You can create full-text indexes on columns of type CHAR, VARCHAR, NCHAR, NVARCHAR, TEXT, NTEXT, IMAGE, XML, and VARBINARY(MAX). Besides using full-text indexes on SQL Server character data, you can store whole documents in binary or XML columns, and use full-text queries on those documents. Columns of data type VARBINARY(MAX), IMAGE, or XML require an additional type column in which you store the file extension (such as .docx, .pdf, or .xlsx) of the document in each row.

You need appropriate filters for documents. Filters, called ifilters in full-text terminology, extract the textual information and remove formatting from the documents. You can check which filters are installed in your instance by using the following query.

EXEC sys.sp_help_fulltext_system_components 'filter';

In addition to using the system stored procedure, you can also check which filters are installed in your instance by querying the sys.fulltext_document_types catalog view, as follows.

SELECT document_type, path
FROM sys.fulltext_document_types;

Many popular formats are supported by default. You can install additional filters, such as filters for Microsoft Office 2010 document formats. You can download Microsoft Office 2010 filter packs at http://www.microsoft.com/en-us/download/details.aspx?id=17062.

After you download the filter packs, you install them on your computer with your SQL Server instance by using the instructions provided with the filter packs. For an Office 2010 filter pack, for example, all you need to do is run the self-extracting downloaded file.

After you install the filter pack on your computer, you need to register the filters in SQL Server by using the following command.

EXEC sys.sp_fulltext_service 'load_os_resources', 1;

You might need to restart SQL Server. After you restart it, check whether the filters were successfully installed by using the sys.sp_help_fulltext_system_components system procedure again.

Word breakers and stemmers perform linguistic analysis on all full-text data. Because rules differ from language to language, word breakers and stemmers are language specific. A word breaker identifies individual words (or tokens). Tokens are inserted in a full-text index in compressed format. The stemmer generates inflectional forms of a word based on the rules of a language. You can use the following query to check which languages are supported in SQL Server.

SELECT lcid, name
FROM sys.fulltext_languages
ORDER BY name;

Stemmers are language specific. If you use a localized version of SQL Server, SQL Server Setup sets the default full-text language to the language of your instance, if the language is supported on your instance. If the language is not supported, or if you use a nonlocalized version of SQL Server, the default full-text language is English. You can specify a different language for each full-text indexed column. You can change the default language by using the sys.sp_configure system procedure.

Word breakers are language specific as well. If a word breaker does not exist for the language of your instance, a neutral word breaker is used. The neutral word breaker uses only neutral characters as spaces for breaking text into individual words.

Imagine that you have documents about SQL Server. The phrase “SQL Server” probably appears in every document. Such a phrase does not help you with searches; however, it bloats a full-text index. You can prevent indexing such noise words by creating stoplists of stopwords. You can check current stopwords and stoplists in your database by using the following queries.

SELECT stoplist_id, name
FROM sys.fulltext_stoplists;
SELECT stoplist_id, stopword, language
FROM sys.fulltext_stopwords;

Full-text queries can search not only for words you provide in a query; they can search for synonyms as well. SQL Server finds synonyms in thesaurus files. Each language has an associated XML thesaurus file. The location of the thesaurus files for a default instance is SQL_Server_install_path\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTDATA\. You can manually edit each thesaurus file and configure the following elements:

  • diacritics_sensitive Set the value of this element to 0 if the language is accent insensitive, or to 1 if it is accent sensitive.

  • expansion Use this element to add expansion words for a word. For example, you can add the expansion word “author” to the word “writer” in order to search for “author” as well when an end user searches for the word “writer.”

  • replacement Use this element to define replacement words or terms for a specific word or term. For example, “Windows 2008” could be a replacement for “Win 2k8.” In such an example, SQL Server would search for “Windows 2008,” even though “Win 2k8” was used in a search term.

After you edit the thesaurus file for a specific language, you must load it with the following system procedure call.

EXEC sys.sp_fulltext_load_thesaurus_file 1033;

The parameter of the procedure denotes language ID; in this case, (1033), which is the US English language.

Full-text queries can search on document properties as well. Which properties can be searched for depends on the document filter. You can create a search property list to define searchable properties for your documents. You can include properties that a specific filter can extract from a document.

Creating and Managing Full-Text Catalogs and Indexes

After you have all of the full-text infrastructure in place, you can start using it. Full-text indexes are stored in full-text catalogs. A full-text catalog is a virtual object, a container for full-text indexes. As a virtual object, it does not belong to any filegroup.

Following is the syntax for creating full-text catalogs.

CREATE FULLTEXT CATALOG catalog_name
     [ON FILEGROUP filegroup ]
     [IN PATH 'rootpath']
     [WITH <catalog_option>]
     [AS DEFAULT]
     [AUTHORIZATION owner_name ]
<catalog_option>::=
     ACCENT_SENSITIVITY = {ON|OFF}

The ON FILEGROUP and IN PATH options are for backward-compatibility for SQL Server 2008 and earlier and have no effect in SQL Server 2012; you should avoid using them. The ACCENT_SENSITIVITY option determines whether full-text indexes in this catalog are accent sensitive or not. If you change this option later, you have to rebuild all full-text indexes in the catalog.

You alter a full-text catalog by using the ALTER FULLTEXT CATALOG statement, and drop it with the DROP FULLTEXT CATALOG statement.

After you have a full-text catalog, you can create appropriate full-text indexes. The syntax for creating a full-text index is as follows.

CREATE FULLTEXT INDEX ON table_name
   [ ( { column_name
             [ TYPE COLUMN type_column_name ]
             [ LANGUAGE language_term ]
             [ STATISTICAL_SEMANTICS ]
        } [ ,...n]
      ) ]
    KEY INDEX index_name
    [ ON <catalog_filegroup_option> ]
    [ WITH [ ( ] <with_option> [ ,...n] [ ) ] ]
[;]
<catalog_filegroup_option>::=
 {
    fulltext_catalog_name
 | ( fulltext_catalog_name, FILEGROUP filegroup_name )
 | ( FILEGROUP filegroup_name, fulltext_catalog_name )
 | ( FILEGROUP filegroup_name )
 }
<with_option>::=
 {
   CHANGE_TRACKING [ = ] { MANUAL | AUTO | OFF [, NO POPULATION ] }
 | STOPLIST [ = ] { OFF | SYSTEM | stoplist_name }
 | SEARCH PROPERTY LIST [ = ] property_list_name
 }

Most of the options are self-describing. You learn about them in the practice for this lesson. The following describes some advanced options:

  • KEY INDEX index_name This is the name of the unique key index on a table. You have to use a unique, single-key, non-nullable column. Integers are recommended.

  • CHANGE_TRACKING [ = ] { MANUAL | AUTO | OFF [ , NO POPULATION ] } This option specifies whether SQL Server updates a full-text index automatically. SQL Server uses a change tracking mechanism to track changes.

  • STATISTICAL_SEMANTICS This option creates additional key phrase and document similarity indexes that are part of statistical semantic indexing.

The last option mentioned, the STATISTICAL_SEMANTICS option, deserves deeper explanation. Statistical semantic search gives you deeper insight into documents by extracting and indexing statistically relevant key phrases. Full-text search uses these key phrases to identify and index documents that are similar or related. You query these semantic indexes by using three T-SQL rowset functions to retrieve the results as structured data. You use these functions in the practices in this chapter. Semantic search extends full-text search functionality. It enables you to query the meaning of the documents. For example, you can query the index of key phrases to build the taxonomy of documents. You can query the document similarity index to identify résumés that match a job description. Semantic search gives you the possibility to create your own text-mining solution. Semantic search could be especially interesting in conjunction with text-mining components of SQL Server Integration Services (SSIS).

In order to use the Semantic Search feature, you have to have Full-Text Search installed. In addition, you need to install the Semantic Language Statistics Database. You install it in the practice for this lesson.

Practice: Creating a Full-Text Index

In this practice, you create a table, populate it with some documents and text data, and create a full-text catalog and index on this table. This practice assumes that the default language of your instance is US English.

If you encounter a problem completing an exercise, you can install the completed projects from the companion content for this chapter and lesson.

EXERCISE 1 Create a Table and Full-Text Components

In this exercise, you create a demo table, populate it with some demo text, and then create stopwords and a stoplist and search document properties.

  1. Start SSMS and connect to your SQL Server instance.

  2. Open a new query window by clicking the New Query button.

  3. Change the context to the TSQL2012 database.

  4. Check whether Full-Text Search is installed by using the following query.

    SELECT SERVERPROPERTY('IsFullTextInstalled');
  5. If Full-Text Search is not installed, run SQL Server Setup and install it. Also install the Microsoft Office 2010 filter packs.

  6. Create a table that you will use for full-text search. Create it in the dbo schema and name it Documents. Use the information from Table 6-1 for the columns of your dbo.Documents table.

    Table 6-1 Column information for the dbo.Documents table

    Column name

    Data type

    Nullability

    Remarks

    id

    INT

    NOT NULL

    IDENTITY, PRIMARY KEY

    title

    NVARCHAR(100)

    NOT NULL

    Name of the documents you are going to import

    doctype

    NCHAR(4)

    NOT NULL

    Type of the documents you are going to import

    docexcerpt

    NVARCHAR(1000)

    NOT NULL

    Excerpt of the documents you are going to import

    doccontent

    VARBINARY(MAX)

    NOT NULL

    Documents you are going to import

    Use the following code for creating the table.

    CREATE TABLE dbo.Documents
    (
      id INT IDENTITY(1,1) NOT NULL,
      title NVARCHAR(100) NOT NULL,
      doctype NCHAR(4) NOT NULL,
      docexcerpt NVARCHAR(1000) NOT NULL,
      doccontent VARBINARY(MAX) NOT NULL,
      CONSTRAINT PK_Documents
       PRIMARY KEY CLUSTERED(id)
    );
  7. Import the four documents included in the folder for this book. If the folder is C:\TK70461, then you can use the following code directly; otherwise, change the folder in the OPENROWSET functions appropriately.

    INSERT INTO dbo.Documents
    (title, doctype, docexcerpt, doccontent)
    SELECT N'Columnstore Indices and Batch Processing',
     N'docx',
     N'You should use a columnstore index on your fact tables,
       putting all columns of a fact table in a columnstore index.
       In addition to fact tables, very large dimensions could benefit
       from columnstore indices as well.
       Do not use columnstore indices for small dimensions. ',
     bulkcolumn
    FROM OPENROWSET(BULK 'C:\TK70461\ColumnstoreIndicesAndBatchProcessing.docx',
                    SINGLE_BLOB) AS doc;
    INSERT INTO dbo.Documents
    (title, doctype, docexcerpt, doccontent)
    SELECT N'Introduction to Data Mining',
     N'docx',
     N'Using Data Mining is becoming more a necessity for every company
       and not an advantage of some rare companies anymore. ',
     bulkcolumn
    FROM OPENROWSET(BULK 'C:\TK70461\IntroductionToDataMining.docx',
                    SINGLE_BLOB) AS doc;
    INSERT INTO dbo.Documents
    (title, doctype, docexcerpt, doccontent)
    SELECT N'Why Is Bleeding Edge a Different Conference',
     N'docx',
     N'During high level presentations attendees encounter many questions.
       For the third year, we are continuing with the breakfast Q&A session.
       It is very popular, and for two years now,
       we could not accommodate enough time for all questions and discussions! ',
     bulkcolumn
    FROM OPENROWSET(BULK 'C:\TK70461\WhyIsBleedingEdgeADifferentConference.docx',
                    SINGLE_BLOB) AS doc;
    INSERT INTO dbo.Documents
    (title, doctype, docexcerpt, doccontent)
    SELECT N'Additivity of Measures',
     N'docx',
     N'Additivity of measures is not exactly a data warehouse design problem.
       However, you have to realize which aggregate functions you will use
       in reports for which measure, and which aggregate functions
       you will use when aggregating over which dimension.',
     bulkcolumn
    FROM OPENROWSET(BULK 'C:\TK70461\AdditivityOfMeasures.docx',
                    SINGLE_BLOB) AS doc;
  8. Create a search property list called WordSearchPropertyList. Add the property Authors to the list. Document properties have predefined GUIDs and integer IDs. See the Books OnLine for SQL Server 2012 article “Find Property Set GUIDs and Property Integer IDs for Search Properties” at http://msdn.microsoft.com/en-us/library/ee677618.aspx for the list of some well-known ones. For the Authors property of Office documents, the GUID is F29F85E0-4FF9-1068-AB91-08002B27B3D9, and the integer ID is 4. Use the following code.

    CREATE SEARCH PROPERTY LIST WordSearchPropertyList;
    GO
    ALTER SEARCH PROPERTY LIST WordSearchPropertyList
     ADD 'Authors'
     WITH (PROPERTY_SET_GUID = 'F29F85E0-4FF9-1068-AB91-08002B27B3D9',
           PROPERTY_INT_ID = 4,
           PROPERTY_DESCRIPTION = 'System.Authors - authors of a given item.');
  9. Create a stopwords list called SQLStopList. Add the word SQL to it, using English as the language. Use the following code.

    CREATE FULLTEXT STOPLIST SQLStopList;
    GO
    ALTER FULLTEXT STOPLIST SQLStopList
     ADD 'SQL' LANGUAGE 'English';
  10. Check the stopwords list and remember the stoplist ID. Use the following query.

    SELECT w.stoplist_id,
     l.name,
     w.stopword,
     w.language
    FROM sys.fulltext_stopwords AS w
     INNER JOIN sys.fulltext_stoplists AS l
      ON w.stoplist_id = l.stoplist_id;
  11. Use the sys.dm_fts_parser dynamic management view to check how full-text search is parsing strings according to your stoplist, thesaurus info, word breaking in the selected language, and stemming in the selected language. For example, the next two queries check how a string is broken into words and what inflectional forms of a word full-text search can use. Note the parameters of the dynamic management view: The first one is the character string to analyze, the second one is the language ID (1033 for US English), the third one is the stoplist ID you got from the previous query, and the fourth one is a flag showing whether the parsing should be accent sensitive or not.

    SELECT *
    FROM sys.dm_fts_parser
    (N'"Additivity of measures is not exactly a data warehouse design problem.
       However, you have to realize which aggregate functions you will use
       in reports for which measure, and which aggregate functions
       you will use when aggregating over which dimension."', 1033, 5, 0);
    SELECT *
    FROM sys.dm_fts_parser
    ('FORMSOF(INFLECTIONAL,'+ 'function' + ')', 1033, 5, 0);

EXERCISE 2 Install a Semantic Database and Create a Full-Text Index

In this exercise, you install a semantic database and then create a full-text index.

  1. Check whether the Semantic Language Statistics Database is installed. If the following query does not return a row, you must install it.

    SELECT *
    FROM sys.fulltext_semantic_language_statistics_database;

    To install the Semantic Language Statistics Database, run the SemanticLanguageDatabase.msi package from the x64\Setup (if you are using a 64-bit instance) or x86\Setup (if your instance is 32-bit) folder from the SQL Server Setup drive.

  2. Check whether the SQL Server service account has Read and Write permissions on the folder where you installed the Semantic Language Statistics Database files. The default folder is C:\Program Files\Microsoft Semantic Language Database. If you installed the database in the default folder, then you can attach it by using the following command.

    CREATE DATABASE semanticsdb ON
     (FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb.
    mdf'),
     (FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb_
    log.ldf')
     FOR ATTACH;
  3. After you attach the database, register it by using the following code.

    EXEC sp_fulltext_semantic_register_language_statistics_db
     @dbname = N'semanticsdb';
  4. Check whether the Semantic Language Statistics Database was successfully installed by repeating the query from step 1. This time, the query should return one row.

  5. Finally, it is time to create a catalog. Name it DocumentsFtCatalog. Use the following code.

    CREATE FULLTEXT CATALOG DocumentsFtCatalog;
  6. Now create a full-text index. You should index the docexcerpt and doccontent columns. Set change tracking for populating the index to AUTO. Use the following code.

CREATE FULLTEXT INDEX ON dbo.Documents
(
  docexcerpt Language 1033,
  doccontent TYPE COLUMN doctype
  Language 1033
  STATISTICAL_SEMANTICS
)
KEY INDEX PK_Documents
ON DocumentsFtCatalog
WITH STOPLIST = SQLStopList,
     SEARCH PROPERTY LIST = WordSearchPropertyList,
     CHANGE_TRACKING AUTO;

Lesson Summary

  • You can create full-text catalogs and indexes by using SQL Server Full-Text Search and Semantic Search.

  • You can improve full-text searches by adding stopwords to stoplists, enhancing a thesaurus, and enabling a search over document properties.

  • You can use the sys.dm_fts_parser dynamic management object to check how Full-Text Search breaks your documents into words, creates inflectional forms of words, and more.

Lesson Review

Answer the following questions to test your knowledge of the information in this lesson. You can find the answers to these questions and explanations of why each answer choice is correct or incorrect in the Answers section at the end of this chapter.

  1. Which full-text search elements can you use to prevent indexing noisy words? (Choose all that apply.)

    1. Stopwords

    2. Thesaurus

    3. Stemmer

    4. Stoplists

  2. Which database do you have to install in order to enable the Semantic Search feature?

    1. msdb

    2. distribution

    3. semanticsdb

    4. tempdb

  3. How can you create synonyms for the words searched?

    1. You can edit the thesaurus file.

    2. You can create a thesaurus table.

    3. You can use the stopwords for synonyms as well.

    4. Full-text search does not support synonyms.