Systems Engineering and RDBMS

FullText Search – III: Using Thesaurus Files

Posted by decipherinfosys on December 17, 2008

In previous posts on FullText Searches, we had discussed how to create full text catalogs and indexes and how to use T-SQL predicates and functions as well as do some complex searches. As we had seen, the inflectional forms are restricted to verb conjugations and plural forms of words which is useful but limited in usage. Using Thesaurus files, one can use XML files in which we can use word replacements and expansion patterns. These files are named as tsnnn.xml and where nnn represents the language and these files reside under the …\MSSQL.1\MSSQL\FTDATA folder in the SQL Server installation on the server.

This is what the tsENU.xml (for US English) looks like on my instance:

<XML ID="Microsoft Search Thesaurus">

<!– Commented out

<thesaurus xmlns=”x-schema:tsSchema.xml”>
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>
</thesaurus>
–>
</XML>

Now, I opened it up in notepad and made these changes:

<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>test</sub>
<sub>check</sub>
<sub>defect</sub>
</expansion>
<replacement>
<pat>this</pat>
<pat>new</pat>
<sub>bug</sub>
</replacement>
</thesaurus>
</XML>

And now, let us create some more records into that table:

insert into open_tran_test (col2) values (‘check this’);
insert into open_tran_test (col2) values (‘new defect’);

And now let’s fire off this query:

select * from open_tran_test
where freetext (col2, ‘test’);

And you will see these results:

COL1        COL2
----------- ----------
1           TEST
6           T2 test
8           tested 1
10          more tests
13          1 testing
16          check this
18          new defect

Notice, that even though we searched for “test”, we also got the last two records that we had inserted.  That is because of the changes we had made to the tsENU.xml file under the expansion section.  The word being searched is expanded upon and is matched to the other words specified under the expansion section.  So, when we searched for “test”, the search was also expanded to look for “check” and “defect”.  We can have as many substitutions as we want under this logic.

And if you look under the replacement section, the words over there are used to replace the words with the words in the <sub> section.  So, if we fire off this query:

select * from open_tran_test
where freetext (col2, ‘new’);

It will look for the word “new” and then replace it with the word “bug” based on the replacement pattern that we have specified in the XML file.

The diacritics_sensitive section at the top of that XML file is used to indicate whether we want the accent marks to be replace during the expansion and replacement.  So, if the value is changed from 0 to 1, we would have apple and ápple treated as the same.  When it is kept as 0 (default), then these two will be treated as separate words.

Another thing to note is that in the case of SQL Server 2005, the thesaurus file is loaded up only by re-starting the service.  In the case of SQL Server 2008 though, there is a new system stored procedure called: “sys.sp_fulltext_load_thesaurus_file” which can be used to reload the file without having a need to restart the service.  In the next post in this series, we will take a look at the noise words elimination (SQL Server 2005) – done via stoplists in SQL Server 2008.  And will also look into how we can search documents stored within SQL Server.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: