Querying Full-Text Data in Microsoft SQL Server 2012

  • 12/15/2012

Lesson 3: Using the Full-Text and Semantic Search Table-Valued Functions

In the previous lesson, you learned that terms in a full-text search can be weighted to change the rank of documents. However, you cannot see the rank by using the CONTAINS predicate. You need to get a table of documents (or document IDs) and their rank. This table is returned by the CONTAINSTABLE and FREETEXTTABLE functions. In addition, you installed the Semantic Language Statistics Database in the practice for Lesson 1. You are now going to exploit semantic search through three table-valued functions: SEMANTICKEYPHRASETABLE, SEMANTICSIMILARITYDETAILSTABLE, and SEMANTICSIMILARITYTABLE.

Using the Full-Text Search Functions

The CONTAINSTABLE and FREETEXTTABLE functions return two columns: KEY and RANK. The KEY column is the unique key from the index used in the KEY INDEX clause of the CREATE FULLTEXT INDEX statement. RANK returns an ordinal value between 0 and 1000. This is the rank value. It tells you how well a row matches your search criteria. The number is always relative to a query; it tells you only relative order of relevance for a particular rowset. A lower value means lower relevance. The actual values are not important; they might even change when you run the same query next time.

The calculation of the rank is quite complex. SQL Server takes into account term frequency—that is, frequency of a searched word in a document, number of words in a document, proximity terms (the NEAR clause), weight (the ISABOUT clause), number of indexed rows, and more. There is a different calculation for the CONTAINSTABLE function and for the FREETEXTTABLE function, because the latter does not support the majority of the parameters that the first one does, like proximity and weight terms.

The shortened syntax for the CONTAINSTABLE is as follows.

CONTAINSTABLE ( table , { column_name | ( column_list ) | * } ,
 ' <contains_search_condition> ' [ , LANGUAGE language_term]  [ , top_n_by_rank ] )

Search conditions are the same as in the CONTAINS predicate. You can use a simple term, a prefix term, a generation term, a proximity term, or a weighted term. The top_n_by_rank is an integer that specifies that only the n rows with highest rank should be returned in the rowset. This parameter could be important for performance, because your query might return huge rowsets.

The syntax for the FREETEXTTABLE is as follows.

FREETEXTTABLE (table , { column_name | (column_list) | * }
          , 'freetext_string'  [ , LANGUAGE language_term ]  [ , top_n_by_rank ] )

Because this syntax is so simple, the complete syntax is shown.

Using the Semantic Search Functions

There are three table-valued functions that enable the semantic search. The syntax for the first one, the SEMANTICKEYPHRASETABLE, is as follows.

SEMANTICKEYPHRASETABLE
    ( table, { column | (column_list) | * }  [ , source_key ] )

This function returns a table with key phrases associated with the full-text indexed column from the column_list. The source_key parameter specifies the unique key from the index used in the KEY INDEX clause of the CREATE FULLTEXT INDEX statement. If you omit it, SQL Server returns key phrases for all rows.

The syntax for the second semantic search function, SEMANTICSIMILARITYDETAILSTABLE, is as follows.

SEMANTICSIMILARITYDETAILSTABLE
    ( table, source_column, source_key, matched_column, matched_key )

This function returns a table with key phrases that are common across two documents. You define the source document with the source_key, which is again the unique key from the index used in the KEY INDEX clause of the CREATE FULLTEXT INDEX statement, and with source_column, which is the name of the full-text indexed column.

The last semantic search function is the SEMANTICSIMILARITYTABLE function, as shown here.

SEMANTICSIMILARITYTABLE
    ( table, { column | (column_list) | * }, source_key )

This function returns a table with documents scored by semantic similarity to the searched document specified with the source_key parameter. The source_key parameter specifies the unique key from the index used in the KEY INDEX clause of the CREATE FULLTEXT INDEX statement. You can use this function to find which documents are the most similar to a specified document.

Practice: Using the Full-Text and Semantic Search Functions

In this practice, you use the full-text and semantic search table-valued functions.

If you encounter a problem completing an exercise, you can install the completed projects from the companion content for this chapter and lesson.

EXERCISE 1 Use the Full-Text Search Functions

In this exercise, you query data with the CONTAINSTABLE and FREETEXTTABLE functions.

  1. If you closed SSMS, start it and connect to your SQL Server instance. Open a new query window by clicking the New Query button.

  2. Connect to your TSQL2012 database.

  3. Write a query that uses the CONTAINSTABLE function to rank the documents based on containment of the words “data” or “level” in the docexcerpt column. Use the following query.

    SELECT D.id, D.title, CT.[RANK], D.docexcerpt
    FROM CONTAINSTABLE(dbo.Documents, docexcerpt,
          N'data OR level') AS CT
     INNER JOIN dbo.Documents AS D
      ON CT.[KEY] = D.id
    ORDER BY CT.[RANK] DESC;
  4. Write a query that uses the FREETEXTTABLE function to rank the documents based on containment of the words “data” or “level” in the docexcerpt column. Compare the result with the result from the previous query. Use the following query.

    SELECT D.id, D.title, FT.[RANK], D.docexcerpt
    FROM FREETEXTTABLE (dbo.Documents, docexcerpt,
          N'data level') AS FT
     INNER JOIN dbo.Documents AS D
      ON FT.[KEY] = D.id
    ORDER BY FT.[RANK] DESC;
  5. Write a query that retrieves the rank of the documents based on containment of the words “data” or “level” in the docexcerpt column. Give the word “data” a weight of 0.8, and the word “level” a weight of 0.2. Compare the results with the results from the first CONTAINSTABLE query in this exercise. Use the following query.

    SELECT D.id, D.title, CT.[RANK], D.docexcerpt
    FROM CONTAINSTABLE
          (dbo.Documents, docexcerpt,
           N'ISABOUT(data weight(0.8), level weight(0.2))') AS CT
     INNER JOIN dbo.Documents AS D
      ON CT.[KEY] = D.id
    ORDER BY CT.[RANK] DESC;
  6. Write a query that retrieves the rank of the documents based on containment of the words “data” and “row” in the doccontent column. The words must be fewer than 30 search terms away. Use the following query.

    SELECT D.id, D.title, CT.[RANK]
    FROM CONTAINSTABLE (dbo.Documents, doccontent,
          N'NEAR((data, row), 30)') AS CT
     INNER JOIN dbo.Documents AS D
      ON CT.[KEY] = D.id
    ORDER BY CT.[RANK] DESC;
  7. Test the previous query with a different distance between search terms.

EXERCISE 2 Use the Semantic Search Functions

In this exercise, you query data by using the SEMANTICKEYPHRASETABLE, SEMANTICSIMILARITYDETAILSTABLE, and SEMANTICSIMILARITYTABLE functions.

  1. Write a query that retrieves the 20 most important semantic search phrases in the documents in the doccontent column. Use the following query.

    SELECT TOP (20)
     D.id, D.title, SKT.keyphrase, SKT.score
    FROM SEMANTICKEYPHRASETABLE
          (dbo.Documents, doccontent) AS SKT
     INNER JOIN dbo.Documents AS D
      ON SKT.document_key = D.id
    ORDER BY SKT.score DESC;
  2. Return all documents but the document with ID equal to 1, ordered by semantic similarity to the document in the doccontent column with ID equal to 1. Use the following query.

    SELECT SST.matched_document_key,
     D.title, SST.score
    FROM SEMANTICSIMILARITYTABLE
         (dbo.Documents, doccontent, 1) AS SST
     INNER JOIN dbo.Documents AS D
      ON SST.matched_document_key = D.id
    ORDER BY SST.score DESC;
  3. Return semantic search key phrases that are common across the document with ID equal to 1 and the document with ID equal to 4. Order the phrases by similarity score. Use the following query.

    SELECT SSDT.keyphrase, SSDT.score
    FROM SEMANTICSIMILARITYDETAILSTABLE
          (dbo.Documents, doccontent, 1,
           doccontent, 4) AS SSDT
    ORDER BY SSDT.score DESC;
  4. Clean up the database.

    DROP TABLE dbo.Documents;
    DROP FULLTEXT CATALOG DocumentsFtCatalog;
    DROP SEARCH PROPERTY LIST WordSearchPropertyList;
    DROP FULLTEXT STOPLIST SQLStopList;
  5. Exit SSMS.

Lesson Summary

  • Full-text functions are useful for ranking results.

  • Semantic similarity functions give you a lot of insight into the documents. You can find key phrases and compare documents.

Lesson Review

Answer the following questions to test your knowledge of the information in this lesson. You can find the answers to these questions and explanations of why each answer choice is correct or incorrect in the Answers section at the end of this chapter.

  1. Which function can be used to rank documents based on proximity of words?

    1. CONTAINSTABLE()

    2. FREETEXTTABLE()

    3. SEMANTICKEYPHRASETABLE()

    4. SEMANTICSIMILARITYTABLE()

    5. SEMANTICSIMILARITYDETAILSTABLE()

  2. Which function can be used to find the document that is most semantically similar to a specified document?

    1. CONTAINSTABLE()

    2. FREETEXTTABLE()

    3. SEMANTICKEYPHRASETABLE()

    4. SEMANTICSIMILARITYTABLE()

    5. SEMANTICSIMILARITYDETAILSTABLE()

  3. Which function returns a table with key phrases associated with the full-text indexed column?

    1. CONTAINSTABLE()

    2. FREETEXTTABLE()

    3. SEMANTICKEYPHRASETABLE()

    4. SEMANTICSIMILARITYTABLE()

    5. SEMANTICSIMILARITYDETAILSTABLE()