Will replace Postgres Oracle soon

Search and find texts with SQL content:

Transcript

1 Searching for and finding texts with SQL Contents: 1. Introduction ... S The LIKE operator (Oracle, MS-SQL, DB2, MySQL, PostgreSQL) ... S How it works ... S Examples ... S Wildcards and Escape -Character ... S Differentiation between upper and lower case ... S Special extensions ... S Regular expressions according to the POSIX standard (Oracle, MySQL, PostgreSQL) ... S Basics ... S Implementation of regular expressions ... S Examples ... S Further functions with regular expressions ... S Regular expressions according to SQL-99 (PostgreSQL) ... S Full-text search with MySQL ... S Creating a full-text index ... S Performing a full-text search ... S Search with "blind query expansion" ... S The Boolean search mode ... S Weighting and linking of individual terms ... S Examples ... S Full-text search with Microsoft SQL Server ... S Creating a full-text index ... S The thesaurus. .. S Carrying out a full-text search ... S Searching for simple terms ... S Linking several terms ... S Searching for variants ... S Searching with relevance calculations on ... S Using weighted terms ... S Choosing the appropriate method ... p. 19

2 1. Introduction One of the main tasks of databases is to find records quickly. Modern DBMS provide more functions for this than is generally known. This tutorial shows which options for text search are basically available in SQL and which options specifically MySQL and PostgreSQL provide for this. In the following, "mysql>" stands for a query for MySQL, "pgsql>" for PostgreSQL, "db2>" for DB2, "mssql>" for MS-SQL Server and "oracle>" for Oracle. I used the following versions for my tests: MySQL 4.1.5gamma, PostgreSQL 8.0.0beta2, DB2 Server 8.2, Microsoft SQL Server 2005 and Oracle 10g XE. 2. The LIKE Operator 2.1. How it works With LIKE you can define simple patterns. If there are no special requirements for the search pattern, LIKE offers the easiest and fastest access to text fields. SQL syntax: text [NOT] LIKE pattern [ESCAPE 'escape-char'] The LIKE patterns can contain two different wildcards:% Any string (no or any number of characters) _ Exactly one character. that most database systems can optimize the performance (e.g. by accessing an index, if available) if the pattern does not start with a wildcard Examples a) Find all data records in whose text field the word "encryption" occurs: mysql> SELECT * FROM testtab WHERE text field LIKE '% encryption%'; b) Find all records that do not begin with the word "encryption": pgsql> SELECT * FROM testtab WHERE text field NOT LIKE 'encryption%'; c) Find all data records for business partners whose names are Meier or Maier: oracle> SELECT * FROM customers WHERE surname LIKE 'M_ier'; 2.3. Wildcards and escape characters Should% and _ not be used as wildcards, but as normal characters

3 are interpreted, they must be preceded by the escape character "\" (backslash) in MySQL. For example, if we want to find all records in which the string "mysql_query" occurs, we write: mysql> SELECT * FROM testtab WHERE text field LIKE '% mysql \ _query%'; The escape character thus becomes a special character itself. For example, if we want to find a backslash as a normal character, e.g. in the string "\ Windows", we write: mysql> SELECT * FROM testtab WHERE text field LIKE '% \ Windows%'; With the ESCAPE clause we can also define our own escape character in MySQL, e.g. '#'. With Oracle, DB2 and MS-SQL, the backslash is not automatically an escape character, which is why we have to specify one explicitly here: mysql> SELECT * FROM testtab WHERE text field LIKE '% mysql # _query%' ESCAPE '#'; db2> SELECT * FROM testtab WHERE text field LIKE '% mysql \ _query%' ESCAPE '\'; 2.4. Distinction between upper and lower case There is an important difference with regard to the treatment of upper and lower case letters: while MySQL and MS-SQL are basically case-insensitive, PostgreSQL, DB2 and Oracle take them into account. The LIKE BINARY operator is used in MySQL to force it to be case-sensitive. In the following example "encryption" should be found, but not "encryption": mysql> SELECT * FROM testtab WHERE text field LIKE BINARY '% encryption%'; PostgreSQL uses the ILIKE operator to turn off case sensitivity. The following query finds both "encryption" and "encryption": pgsql> SELECT * FROM testtab WHERE text field ILIKE '% encryption%'; DB2 and Oracle do not provide their own operator or modifier to change the behavior of LIKE. There is, however, another way of avoiding the distinction between upper and lower case. Only lower case letters are used in the search pattern and the text in the column that is being searched for is also converted into lower case letters with the LOWER function. The following query finds both "encryption" and "encryption": db2> SELECT * FROM testtab WHERE LOWER (text field) LIKE '% encryption%'; In Oracle this function is called NLS_LOWER:

4 oracle> SELECT * FROM test tab WHERE NLS_LOWER (text field) LIKE '% encryption%'; Incidentally, this also works in the same way with a comparison only between capital letters. The corresponding functions for conversion are called UPPER and NLS_UPPER. Special extensions A MySQL extension to SQL-99 is the ability to apply LIKE to numeric types. In the following example, all article numbers that begin with 2 are to be found: mysql> SELECT * FROM article WHERE article_no LIKE '2%'; In conjunction with the LIKE operator, MS-SQL allows the use of a list of permitted characters. This is limited with [], for example [a-eab] finds the lower case letters a, b, c, d, e and the upper case letters A, B. The ^ character is used for negation. [^ a-eab] matches all but the specified characters. As an example, the entries that contain a three-digit number enclosed in round brackets are to be found: mssql> SELECT * FROM testtab WHERE text field LIKE '% ([0-9] [0-9] [0-9])%'; To illustrate the function of the negative list, all entries should now be found that do not begin with the letter M (note that the search is case-sensitive, i.e. both m and M are counted as hits): mssql> SELECT * FROM testtab WHERE text field LIKE '[^ m]%'; 3. Regular expressions according to the POSIX standard 3.1. Basics Most DBMS should also be able to interpret regular expressions according to the POSIX standard. This allows much more precise search patterns to be determined than is possible with LIKE. MySQL, PostgreSQL and Oracle also adhere to this standard. For a detailed description of how regular expressions work, see other tutorials and relevant specialist literature. A list of the most important constructs with a short description should suffice here. The most important special characters of regular expressions: ^ The beginning of a character string $ The end of a character string

5. Any character [] enclose a list of permitted characters, for example [a-eab] finds the lower case letters a, b, c, d, e and the upper case letters A, B. The ^ character is used for negation. [^ a-eab] finds all but the specified characters [: characterclass:] a character class is an abbreviation for a character list. The classes depend on the locale setting (local character set). A list can be found below. a logical "or", which identifies alternatives: "pi e" finds both "pi" and "e" () combines several characters into a unit (group) in order to e.g. to be able to use together with a quantifier * the preceding unit can appear any number of times (not even once) in a row + the preceding unit can appear once or several times (but not never)? the preceding unit can occur once or never, but not more than once {n} the preceding unit occurs exactly n times {n, m} the preceding unit occurs at least n times and at most m times {n,} the preceding unit occurs at least n times A list of the most important character classes: [[: upper:]] [[: lower:]] [[: alpha:]] [[: alnum:]] [[: digit:]] [ [: space:]] All uppercase letters All lowercase letters All letters All letters and numbers All numbers Spaces PostgreSQL and Oracle offer more powerful regular expressions than MySQL. In addition to the POSIX expressions, extensions are supported that can also be found in Perl-compatible regular expressions, e.g. Greedy and non-greedy quantifiers, back references, shorthands for certain character classes, etc. Quantifiers are basically greedy in PostgreSQL as long as they don't have? is adjusted. For example, "*?" and "{3,}?" the non-greedy versions of the quantifiers "*" and "{3,}". Backreferences that refer to previous groupings (enclosed in round brackets) are also written in the usual notation: \ 1 for the first grouping, \ 2 for the second, etc. The most important shorthands are \ d (all digits), \ w (Letters, numbers and _) and \ s (space). These shorthands are negated by capitalizing them: \ D (no digits), \ W (no word characters), \ S (no spaces). Note that a backslash is a special character for PostgreSQL and therefore has to be "invalidated" by a backslash in these cases, so you have to write "\ 1" for "\ 1" and "\ d" for "\ d" . Oracle recognizes from the context what purpose the backslash fulfills, so this is not necessary here (see examples below). Farther

6, Oracle does not seem to accept the shorthands for character classes (e.g. \ d, \ w) within a list of permitted characters limited by []. In this case you have to write out the character class (eg [: digit:], [: alnum:], see example e) below) Implementation of regular expressions In connection with regular expressions, MySQL, PostgreSQL and Oracle use different operators or functions. MySQL knows the REGEXP and RLIKE operators. They are equivalent. SQL syntax: text [NOT] REGEXP [BINARY] pattern text [NOT] RLIKE [BINARY] pattern With NOT a regular expression can be negated as with LIKE. BINARY results in a search that is case-sensitive. PostgreSQL has no REGEXP and no RLIKE. Instead, in the best tradition (see the Perl syntax), the ~ operator (tilde) is used. Its application: PostgreSQL meaning Correspondence in MySQL text ~ pattern text must match pattern REGEXP BINARY text! ~ Pattern text must not match pattern NOT REGEXP BINARY text ~ * pattern text! ~ * Pattern text must match pattern, case-sensitive ignored text must not match pattern, upper and lower case letters ignored REGEXP NOT REGEXP Oracle has only recognized regular expressions since version 10g. The REGEXP_LIKE function is available for this: [NOT] REGEXP_LIKE (text, pattern [, search options]) The third parameter is optional and can contain one or more of the following options: cinm search is case-sensitive (default setting) Search is case-sensitive / Lower case not allowed Operator for any characters including line break (point) Treats text as multi-line if there is a line break 3.3. Examples a) Find the names Meier, Maier and Mayr (in capitalization):

7 mysql> SELECT * FROM customers WHERE surname REGEXP BINARY '^ M (eie aie ay) r $'; pgsql> SELECT * FROM customers WHERE surname ~ '^ M (eie aie ay) r $'; oracle> SELECT * FROM customers WHERE REGEXP_LIKE (last name, '^ M (eie aie ay) r $'); b) Find data records that contain a three- or more-digit number in round brackets: mysql> SELECT * FROM testtab WHERE text field RLIKE '\ ([[: digit:]] {3,} \)'; pgsql> SELECT * FROM testtab WHERE text field ~ * '\ (\ d {3,} \)'; oracle> SELECT * FROM testtab WHERE REGEXP_LIKE (text field, '\ (\ d {3,} \)'); c) Find records that contain "DESede" or "DES-ede": mysql> SELECT * FROM testtab WHERE text field RLIKE BINARY 'DES-? ede'; pgsql> SELECT * FROM testtab WHERE text field ~ 'DES-? ede'; oracle> SELECT * FROM testtab WHERE REGEXP_LIKE (text field, 'DES-? ede'); d) Find records that do not contain spaces: mysql> SELECT * FROM testtab WHERE text field REGEXP '^ [^ [: space:]] * $'; pgsql> SELECT * FROM testtab WHERE text field ~ * '^ \ S * $'; oracle> SELECT * FROM testtab WHERE REGEXP_LIKE (text field, '^ \ S * $'); e) Find records that contain a mit.com-domain: mysql> SELECT * FROM testtab WHERE text field REGEXP '[[: alnum:] \ ._-] pgsql> SELECT * FROM testtab WHERE text field ~ * oracle> SELECT * FROM testtab WHERE REGEXP_LIKE (text field, '[[: alnum:] ._-]' i '); f) Find records that contain italic or underlined text in HTML: mysql> SELECT * FROM testtab WHERE text field RLIKE '. * <\ / u>' OR text field RLIKE '. * <\ / i>' ; pgsql> SELECT * FROM testtab WHERE text field ~ * '<(u i)>. *? <\ / \ 1>'; oracle> SELECT * FROM testtab WHERE REGEXP_LIKE (text field, '<(u i)>. *? ', 'in'); 3.4. Further functions with regular expressions (Oracle, PostgreSQL) Oracle provides functions that use regular expressions not only to find suitable text passages, but also for output formatting. The REGEXP_INSTR function is used to display the position of a specific text passage in a text. The return value is a positive integer or 0 if no suitable text passage is found. Syntax: REGEXP_INSTR (text, pattern [, position [, occurrence [, return [, search options]]]])

8 Only the first two parameters are mandatory. The "position" parameter is a positive integer (> 0) and specifies the position (number of characters) at which the search should begin. The parameter "occurrence" also expects a positive integer or 0 and determines how many occurrences should be counted. The "return" parameter can be used to determine whether the beginning or the end of the occurrence found should be used when determining the position (0 = beginning, 1 = end). The permitted search options, however, correspond to those of REGEXP_LIKE. As an example, it should be determined at which position a text underlined or italicized by HTML begins the second time. Note that the text itself only begins after the HTML tag, so we set the "return" parameter to 1 in order to get the position exactly after the HTML tag: oracle> SELECT titel, REGEXP_INSTR (textfeld, '<(ui )> ', 1, 2, 1,' i ') AS position FROM testtab; The REGEXP_SUBSTR function is used to extract the text elements that are found on the basis of the pattern and to discard the rest. If no suitable text passage is found, the function returns NULL. Syntax: REGEXP_SUBSTR (text, pattern [, position [, occurrence [, search options]]]) The function of the parameters corresponds to the parameters of the same name of the REGEXP_INSTR function. Now a small example: the mail addresses in the customer table should be used to determine which domains the addresses are assigned to as a whole. To do this, we need to extract the second string that contains. This gives us the string after oracle> SELECT DISTINCT REGEXP_SUBSTR (, 1, 2) AS domains FROM customers; The SUBSTRING function in PostgreSQL allows the extraction of a character string not only with precise position information for the beginning and the end, but also with the help of a pattern. It corresponds roughly to the Oracle function REGEXP_SUBSTR. Syntax: SUBSTRING (text FROM pattern [FOR escape_character]) The "pattern" parameter is also a regular expression here. An alternative escape character can optionally be specified after FOR. Here, too, the domains of the customer's e-mail addresses should be returned as an example: pgsql> SELECT DISTINCT SUBSTRING (FROM AS domains FROM customers; Finally, we can use the REGEXP_REPLACE function to replace certain text passages with others. The text that does not match the pattern remains , and is not dropped, as is the case with REGEXP_SUBSTR. If no suitable text passage is found for replacement, the original text is returned. Syntax:

9 REGEXP_REPLACE (text, pattern [, replacement [, position [, occurrence [, search options]]]]) Everything runs as usual except for the third parameter. The "replacement" parameter specifies the character string by which all matching text passages are to be replaced. Backreferences are also allowed. If an empty string is passed as a replacement string or the parameter is omitted entirely, all text passages found are simply deleted. As an example, the text formatting "italic" and "underlined" should be replaced by bold in HTML documents. oracle> SELECT REGEXP_REPLACE (text field, '<(i u)> (. *?) ', ' \ 2 ', 1, 0, 'in') AS newtext FROM testtab; 4. Regular expressions according to SQL-99 (PostgreSQL) PostgreSQL implements regular expressions according to SQL-99. These are regular expressions, the characteristics of the LIKE operator, and ordinary regular expressions, but without achieving the flexibility of the latter. This type of regular expression is available with the SIMILAR TO operator.SQL syntax: text [NOT] SIMILAR TO pattern [ESCAPE 'escape-char'] The following special characters (and only these) are defined:% replaces none or any number of characters _ replaces exactly one character a logical "or", the Alternatives denotes [] enclose a list of permitted characters as in regular expressions (POSIX) () combines several characters into a unit (group) * the preceding unit can appear any number of times (not even) one after the other + the preceding unit can appear once or several times (but not even) Note that the pattern has to match the entire text or string just like with LIKE (while regular expressions naturally find an occurrence somewhere in the text). A few simple examples should suffice at this point: a) Find the names Meier, Maier and Mayr (in capitalization): pgsql> SELECT * FROM customers WHERE surname SIMILAR TO 'M (eie aie ay) r'; b) Find records that contain italic or underlined text in HTML:

10 pgsql> SELECT * FROM testtab WHERE text field SIMILAR TO '% % %' OR text field SIMILAR TO '% % %'; c) Find records that may only consist of the lower case letters a-r and digits: pgsql> SELECT * FROM testtab WHERE text field SIMILAR TO '[[: digit:] a-r] +'; 5. Full text search with MySQL 5.1. Creating a full text index MySQL offers a special type of full text search based on relevance calculation. In order to be able to use this type of search, the queried columns must be provided with a FULLTEXT index. Unfortunately, this means that the possibility of a full-text search is limited to MyISAM tables. For the examples below, let's create a "dbnews" table with a FULLTEXT index on all fields that contain texts or strings, plus some sample entries. mysql> CREATE TABLE dbnews (id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, title VARCHAR (200) NOT NULL, content TEXT NOT NULL, FULLTEXT (title, content)); mysql> INSERT INTO dbnews VALUES -> (NULL, 'MySQL 4.1 before release', 'Version of the popular DBMS MySQL will appear soon ...'), -> (NULL, 'New C-API in MySQL', 'Soon the old mysql functions belong to the scrap heap ... '), -> (NULL,' MySQL vs. PostgreSQL ',' In this comparison of two freely available databases ... '), -> (NULL,' New ODBC -Driver ',' These new drivers bring improvements in performance ... '), -> (NULL,' Tutorial Stored Procedures', 'This tutorial shows the use of stored procedures in MySQL ...'), -> ( NULL, 'Overview DBMS', 'Here we want to compare different databases, including MySQL ...'); 5.2. Carrying out a full text search The actual search is carried out with MATCH AGAINST. SQL syntax: MATCH (column1 [, column2, ...]) AGAINST (expression [IN BOOLEAN MODE WITH QUERY EXPANSION]) The statement returns a positive floating point number (type FLOAT) that expresses the relevance (where 0 means no match ). Used within a WHERE clause, this means that the hits are already sorted, with the data record that is most relevant at the beginning. Note that the listing of the columns after MATCH exactly matches the FULLTEXT columns of the queried

11 table must match. Now let's look for the (exact) word "databases" in all indexed fields: mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('databases'); 5.3. Search with "blind query expansion" Now the problem arises that a user who searches for "databases" also wants to find entries in which a DBMS is explicitly named by name, e.g. "MySQL", "PostgreSQL" etc. To accomplish this, MySQL supports the addition WITH QUERY EXPANSION since version. This is made possible by a double internal search: if MySQL finds "databases" in the first run together with words like "MySQL" or "PostgreSQL", it searches for these very words in the second run. MySQL assigns a word that occurs in almost every data set a very low relevance, while a rare word has a higher relevance. This feature only becomes really effective with larger tables. A fundamental problem remains, however: a lot of useless items can be returned, which means that this so-called "blind query expansion" is only suitable for short search expressions. Now let's improve our query: mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('Databases' WITH QUERY EXPANSION); 5.4. The Boolean search mode If we now search specifically for "MySQL", we will experience a surprise: mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('MySQL'); Although "MySQL" is contained in 5 of 6 entries, we get an empty result. This is a kind of protection mechanism of MySQL: If a word occurs in more than half of all entries, MySQL does not return a result. Understandable if you imagine a table with entries. To stop this behavior, we use the IN BOOLEAN MODE extension (available since MySQL 4.0.1, cannot be used together with WITH QUERY EXPANSION): mysql> SELECT id, MATCH (title, content) AGAINST ('MySQL' IN BOOLEAN MODE) FROM dbnews WHERE MATCH (title, content) AGAINST ('MySQL' IN BOOLEAN MODE); Since the return from MATCH AGAINST is no longer a floating point number, but a Boolean value, the results can no longer be sorted. In addition, a Boolean full-text search can also be used on tables and fields that do not contain a FULLTEXT index. Then it is a bit slower. Weighting and linking of individual terms. In addition, the following operators are supported in the search expression:

12 + placed in front of a word: this word must be included - placed in front of a word: this word must not be included> placed in front of a word: this word is rated higher in relevance SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('MySQL PostgreSQL 'IN BOOLEAN MODE); b) Find records that contain both "MySQL" and "PostgreSQL": mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST (' + MySQL + PostgreSQL 'IN BOOLEAN MODE); c) Find records that contain "MySQL" but not "PostgreSQL": mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('+ MySQL -PostgreSQL' IN BOOLEAN MODE); d) Find records that contain "MySQL". If "PostgreSQL" is also included, they should be given a higher rating (this gives the effect that one would have without a boolean search): mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('+ MySQL PostgreSQL' IN BOOLEAN MODE ); e) Find data records that contain the exact string "free database" (but not, for example, "free commercial database"): mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('"free database"' IN BOOLEAN MODE) ; f) Find records that contain "MySQL" and "Database" (or all words that begin with "Database") or "PostgreSQL" and "Database", whereby PostgreSQL is rated higher:

13 mysql> SELECT * FROM dbnews WHERE MATCH (title, content) AGAINST ('+ ( PostgreSQL) + database *' IN BOOLEAN MODE); 6. Full-text search with Microsoft SQL Server Microsoft has donated a full-text module to its SQL Server, which allows even more flexible full-text searches than is possible with MySQL. In addition to the linking of individual search terms with logical operators (AND, OR, NOT) and the weighting of the individual search terms, the SQL Server also finds inflected forms of a word or equivalent terms thanks to powerful, language-dependent word stem recognition and a thesaurus. Unfortunately, the free express version of SQL Server does not come with a full text module. The full-text search is therefore only available in the commercial versions. Creating a full-text index The full-text search is not restricted to a specific table format, as is the case with MySQL, but can basically be used on all CHAR, VARCHAR and TEXT fields, provided that the Table has a unique index. However, this is usually given in the form of a primary key. Before we start, we still have to check whether the full text module is already installed and the service has started. It may have to be installed later. The next step is to check whether the database we are going to work with (again "test" in my example) has been activated for full-text indexing. This can be queried with the following query: mssql> SELECT DATABASEPROPERTY ('test', 'IsFullTextEnabled'); If the full text indexing is activated, the query returns 1. Otherwise we have to activate the full text indexing manually (from now on we assume that the "test" database has been selected): mssql> sp_fulltext_database enable; The examples that follow later use a table "dbnews" and can be understood using the following example data records. I assigned an explicit name to the primary key. We shall see the reason for this later. mssql> CREATE TABLE dbnews (id INT NOT NULL, title VARCHAR (200) NOT NULL, content TEXT NOT NULL, CONSTRAINT myindex PRIMARY KEY (id)); mssql> INSERT INTO dbnews VALUES -> (1, 'Introduction to SQLJ', 'A training course on SQLJ took place at the beginning of the month'), -> (2, 'New JDBC drivers',' The new drivers for establishing a connection is now available for download via Java '), -> (3,' DB2 Express ',' The popular DBMS from IBM appears in a free Express version '), -> (4,' MySQL vs. PostgreSQL ',' Unser Volunteer compares two freely available databases'), -> (5, 'Comparison test', 'The editor-in-chief compared many different databases such as DB2 and Oracle');

14 The exact structure of the full text module can be found in the online documentation. At the moment we are only interested in the fact that every full-text index must be assigned to a full-text catalog. If a full text catalog does not yet exist, you have to create a new one with CREATE FULLTEXT CATALOG. The (shortened) SQL syntax: CREATE FULLTEXT CATALOG catalog name [AS DEFAULT] The exact syntax and other options are not relevant for us and can be looked up in the online documentation. We just remember that every full-text catalog needs a unique name. The AS DEFAULT option can also be omitted. If it is specified, this means that a full-text index that is not explicitly assigned to a full-text catalog is assigned to this catalog by default. In the following example, a full text catalog called "mycatalog" is created: mssql> CREATE FULLTEXT CATALOG mycatalog AS DEFAULT; The full text index itself is created with CREATE FULLTEXT INDEX. The (shortened) syntax for this: CREATE FULLTEXT INDEX ON table name (column1 [LANGUAGE 'language'] [, column2 [LANGUAGE 'language'], ...]) KEY INDEX index name [ON catalog name] Here, too, I'll be satisfied with the most important ones Options and otherwise refer to the documentation. In addition to the table and one or more columns (separated by commas), the name of the table's unique index must be specified. Optionally, the language can be specified for each column if it deviates from the standard setting (on German systems the standard is usually 'german'). The full-text catalog to which the full-text index is to be assigned can also be specified. If no full-text catalog is mentioned, the index is assigned to the catalog that was last created with AS DEFAULT. Now it becomes clear why I assigned an explicit name to the primary key when I created the table "dbnews". We now create a full text index over the table "dbnews" with the key "myindex" and include the columns "title" and "content" in the full text index. For demonstration purposes, we also explicitly assign the full-text index to the full-text catalog "mycatalog", although we could save ourselves that, since "mycatalog" is our default full-text catalog. mssql> CREATE FULLTEXT INDEX ON dbnews (title, content) KEY INDEX myindex ON mycatalog; The full-text index and full-text catalog can be changed with ALTER FULLTEXT INDEX or ALTER FULLTEXT CATALOG and deleted with DROP FULLTEXT INDEX or DROP FULLTEXT CATALOG. More details can be found in the documentation The Thesaurus In normal usage, "database" and "database management system" (DBMS) are often equated, although this is not entirely correct. A "database" represents the actual data stock, while as a DBMS the software for

15 database management and control of user transactions. The sample data sets I have given also contain this inaccuracy. In order to enable the user, who does not know the difference exactly or does not even know what a DBMS is, to find entries with "DBMS", if he only enters "Database", we can use the thesaurus of the SQL Server to edit. The thesaurus is used for search queries that should also take synonyms into account (see Chapter). With the thesaurus, the result is of course more controllable and more precise than a query with "blind query expansion" could ever be. The thesaurus itself is in XML format. If the SQL Server was installed in C: \ Programs, the thesaurus files can be found under C: \ Programs \ Microsoft SQL Server \ MSSQL.1 \ MSSQL \ FTData \. The file with the German thesaurus is called tsdeu.xml, the English one is tseng.xml etc. Now we insert the following section between and : database dbms < / sub> This means that the thesaurus recognizes "database" and "DBMS" as equivalent. Note that the thesaurus must not be commented out in the file. The changes take effect after restarting the server. Performing a full-text search The full-text search itself is performed with the CONTAINS predicate in the WHERE clause. Syntax: CONTAINS (column (column list) *, 'search expression' [, LANGUAGE 'language']) The column to be queried or a list of columns is passed as the first argument. In the latter case, the list must be limited by round brackets and the individual columns must be separated by commas. If an asterisk (*) is specified, all columns of the table registered for the full text search are searched. The actual search expression is expected as the second argument. It is always (!) Limited by single quotes. Optionally, the language can again be specified as the third argument. In the following, we will focus our attention on the search term Search for simple terms Simple words or phrases that are to be searched for are always enclosed in double quotes. Note that this does not mean that the single quotation marks that limit the entire search term are omitted! As an example, we search for the word "databases" in the columns "title" and "content" and again specify the language German for demonstration purposes: mssql> SELECT * FROM dbnews WHERE CONTAINS ((content, title), '"databases"' , LANGUAGE 'german');

16 If you only want to search for the beginning of a word, write an asterisk instead of the possible end of the word. This must be placed directly in front of the final apostrophe and masks any word characters. All words or phrases that begin with the text that precedes the asterisk are found. The following example searches for all data records that contain a word in the "content" column that begins with "Ver" or "ver" (case-insensitive): mssql> SELECT * FROM dbnews WHERE CONTAINS (content, ' "Ver *" '); Linking several terms Several words or phrases can be linked using the logical operators AND, OR, AND NOT. The combination with the prefix search is possible. As an example, all data records are to be found that contain "database" or "DBMS" in all indexed columns (in our case "title" and "content"): mssql> SELECT * FROM dbnews WHERE CONTAINS (*, '"database" OR " DBMS "'); In addition, two or more terms can be combined with NEAR or ~ (tilde). The associated terms must be close to one another, with the degree of correspondence increasing the closer they are. Now let's look for the word "training" near "SQLJ": mssql> SELECT * FROM dbnews WHERE CONTAINS (content, '"training" NEAR "SQLJ"'); Search for variants The particular strength of the SQL Server lies in the fact that it is also easy to find derivatives of a word or meaning equivalents. This can be achieved by using a term with the keyword FORMSOF in the search expression. Syntax: FORMSOF (INFLECTIONAL THESAURUS, "wort" [, "wort2", ...]) INFLECTIONAL is used when derivatives of a word (forms that result from the declination of a noun or the conjugation of a verb) should also be included in the search . If THESAURUS is specified, however, the internal thesaurus is used to find synonyms.In the following example the verbs "compare" and "appear" are to be found in all tenses and persons, including "appears", "compares" and "compared" etc .: mssql> SELECT * FROM dbnews WHERE CONTAINS (content, 'FORMSOF (INFLECTIONAL, "compare", "appear") '); Now let's use the thesaurus to find all entries with the word "database" or synonyms:

17 mssql> SELECT * FROM dbnews WHERE CONTAINS (content, 'FORMSOF (THESAURUS, "database")'); One possibility to find all variants in one query, i.e. both inflected forms and synonyms, is offered by the FREETEXT predicate. Syntax: FREETEXT (column (column list) *, 'words' [, LANGUAGE 'language']) The first and third arguments correspond to those of the CONTAINS predicate. There is a difference in the second argument. Only one word or several words separated by spaces are expected here in single quotation marks. The words themselves are not enclosed in double quotes, as is the case with CONTAINS. Other operators such as AND, OR etc. are also not allowed or are sorted out as so-called "noise words". The words passed as the second argument are separated using their spaces and inflected forms are created for each word and synonyms are searched for using the thesaurus. The search with FREETEXT is overall somewhat less precise than that with CONTAINS. As an example, let's find all records that contain "compare" or "database" or some variant of these two words. The specification of the language is only used here for clarification. mssql> SELECT * FROM dbnews WHERE FREETEXT (content, 'compare database', LANGUAGE 'german'); 6.4. Search with relevance calculation The CONTAINS and FREETEXT predicates return Boolean values ​​(TRUE or FALSE), can only be used in a WHERE or HAVING clause and are therefore not suitable for ranking according to relevance. If the relevance for each hit is to be calculated, you have to use two functions, CONTAINSTABLE and FREETEXTTABLE. Their syntax is very similar to that of CONTAINS or FREETEXT: CONTAINSTABLE (table, column (column list) *, 'search expression' [, LANGUAGE 'language'] [, top]) FREETEXTTABLE (table, column (column list) *, 'words' [, LANGUAGE 'language'] [, top]) The first parameter is the name of the table to be queried, and the optional fourth or third (if LANGUAGE is not specified) parameter is an integer that indicates how many hits should be returned. The hits that are most relevant are taken into account. The really special thing about these two functions is their return value: a table with the columns "key" and "rank". The column "rank" indicates the relevance with a number between 0 and 1000, where 1000 means the highest relevance. It is important that these are not absolute values, but must be seen in relation to the individual hits. The "key" column contains the key attribute of the hit. The two functions can now be used as follows: since they return a result set (i.e. a table), they must be specified in the FROM clause. Via the "key" column, it can be joined to the table that contains the actual data. An example should make this clear: we want the titles of the three articles from "dbnews"

18 which have the highest relevance when searching for the word "database". mssql> SELECT dbnews.titel, ct.rank FROM dbnews, CONTAINSTABLE (dbnews, content, '"database"', LANGUAGE 'german', 3) AS ct WHERE dbnews.id = ct. [key]; In the CONTAINSTABLE function, we specified that the "content" column in the "dbnews" table should be searched for the word "database". In addition to the German language, it is also specified that only the three most relevant titles should be output. This also creates a descending sorting according to "rank". The table returned by CONTAINSTABLE is given the alias name ct so that it can be used in the join condition without problems. The "key" column of the "ct" table is joined with the "id" column (primary key!) Of "dbnews". Because "key" is actually a reserved word, it must be put in square brackets here. Another example with the FREETEXTTABLE function: Here the articles are to be found which contain the word "database" or a variant of it. The result should be sorted in descending order of rank. The construction is analogous and does not need any further explanation: mssql> SELECT dbnews.titel, ft.rank FROM dbnews, FREETEXTTABLE (dbnews, content, 'database') AS ft WHERE dbnews.id = ft. [Key] ORDER BY ft. rank DESC; The online documentation gives the exact formulas according to which CONTAINSTABLE and FULLTEXTABLE calculate the relevance. In my work with the full-text search, the rankings, similar to the full-text search with MySQL, were unfortunately not always completely understandable. Syntax: ISABOUT ('concept' [WEIGHT (weighting)] [, 'concept2' [WEIGHT (weighting2)], ...]) A complete search expression, ie simple words, several linked words, search for prefixes, can be used as a "concept" or by variants. Any number of terms can be specified separated by commas. The individual terms are implicitly linked with OR, i.e. a match only has to be found for one of the terms and not for all. Specifying a weighting for each term is optional and is done using the WEIGHT keyword. Each term can be weighted with a floating point number between 0 and 1, with larger values ​​giving greater weight to a term. The notation of the decimal point is subject to local server settings. An English-language system uses a point as the decimal separator, whereas a comma is used on a German system. Since the comma normally separates elements of a list, double quotation marks are also necessary. If WEIGHT is omitted, the term automatically receives the highest weighting (1). Finally, a slightly more complex example. We want to find the articles that contain "Java", "Tutorial" or a synonym for them, or words that begin with "Training". "Tutorial" should be given more weight than "Training". When issuing

19 we are only interested in the top 10. mssql> SELECT dbnews.titel, ct.rank FROM dbnews, CONTAINSTABLE (dbnews, content, 'ISABOUT ("Java", FORMSOF (THESAURUS, "Tutorial") WEIGHT ("0.7"), "Training *" WEIGHT (" 0.2 ")) ', 10) AS ct WHERE dbnews.id = ct. [Key]; 7. Choosing the Appropriate Method We have now learned the most important techniques for an effective search in SQL. Ultimately, the question arises as to which one is best suited for which type of search. However, this question is very difficult to answer. In general, the simpler the procedure used, the more efficient it will be. So, if the circumstances permit, a search with LIKE is preferable to a search with REGEXP. The latter should be used when the search pattern is to be defined very precisely. It then delivers more useful and correct results, which can more than make up for the loss of performance. In the case of large databases with many millions of data records, however, there is hardly any way around the full text search. You can wait minutes for requests with LIKE or even with regular expressions. The full-text search, on the other hand, remains relatively high-performance and is easier to maintain than a manually maintained word index in a look-up table or similar jokes. It is also interesting to have the option of sorting the results according to relevance. However, the full-text search does not always provide as precise results as a query with LIKE or regular expressions. In addition, MySQL is limited to the MyISAM table format. The best method can hardly be determined ex ante. However, the database developer should be accountable for his decision. Ultimately, you will not (only) select a database system according to the search options, but according to other criteria such as integration into the existing IT, etc., and then you have to get along with what the DBMS of your choice offers. Christoph Bichlmeier chris 'at' bichlmeier 'dot' info Website: created on:, Update: