Working with UNICODE in InterBase/Firebird

Dmitri Kovalenko, LCPI

dima@lcpi.lipetsk.ru

What is UNICODE_FSS?

It is an InterBase codepage (often called UTF-8), which displays double-byte and four-byte UNICODE characters (UCS-2 e UCS-4, respectively) in character strings from 1 to 6 bytes. What is it intended for?

  • It provides transport for UNICODE texts based on regular ASCII text.
  • Data packing. The characters with codes less than 128 are, as usual, represented as one byte

Interconversion of UTF-8 and UCS-2 (UCS-4) files is based on use of the following table:

Table 1 Converting UCS-2 to UTF-8

Bits Hex Min Hex Max UTF-8 Binary Encoding
7 00000000 0000007F 0xxxxxxx
11 00000080 000007FF 110xxxxx 10xxxxxx
16 00000800 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 00010000 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 00200000 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 04000000 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

UNICODE (the text format one usually deals with, when programming for Windows) uses double-byte character set (UCS-2). As evident from the table above, 1- 3 bytes would be enough for presentation of the UNICODE characters with a code from 0×0000 to 0xFFFF in UTF-8.

InterBase proceeds from the fact that UNICODE characters within the range [0×0000,0xFFFF] will be stored in UTF-8. Therefore, when specifying size for storing text data with UNICODE_FSS codepage, the number of characters is multiplied by 3 bytes. In addition, all current versions of InterBase, when working with CHAR/VARCHAR data in data presentation, controls number of bytes, not of characters. That is why, for example, one can enter up to 9 single-byte characters to the CHAR(3) column with UNICODE_FSS codepage.

Example

CREATE TABLE TESTUTF(TESTFIELD CHAR(3));
INSERT INTO TESTUTF (TESTFIELD) VALUES('123456789'); -- NO EXCEPTION!

The main disadvantage of working with UTF-8 is the impossibility to detect the number of symbols in a string without viewing it.

That is why, for better performance, it is recommended to recode text data in UTF- 8 format to a codepage with characters of fixed size (such as UCS-2).

Creation of database using UNICODE_FSS

Nothing out of the way, everything is as usual:

CREATE DATABASE ... DEFAULT CHARACTER SET UNICODE_FSS.

Connecting to a database with a UNICODE_FSS codepage

As is well known, when creating a database, one can use one codepage, and when connecting to a database, can use a different codepage can be used. It’s partly true since a server when interacting with a client, tries to recode text data to a coding the client wishes to use.

However, this mechanism has constraints and exceptions, which are described below. That is why we recommend not to experiment with it. When connecting to a database, specify the codepage, which was used during database creation.

In our case, it is UNICODE_FSS. This means that we want to get text data in UTF-8 format from the database, and the text will pass the text in the same format.

Working with database

Exchange with server uses the following data categories:

  • General text fields with CHAR and VARCHAR types
  • BLOB text fields
  • Arrays
  • SQL query text.

Text fields

When reading text fields, the server requires providing a buffer: (max character size)*(number of characters). This is the value, which will be put in XSQLVAR.sqllen after data access query text is prepared. A user does not have to worry about calculation of size needed for text field data in the clipboard. We recommend using the value specified in the definition of the field. However, do not forget that, for the VARCHAR columns (SQL_VARYING type), one should add 2 bytes to the specified value, in order to reserve some space for column length indicator. The returned value will be expressed in bytes.

TIP: By the way, one of the undocumented features of InterBase is that for text fields in XSQLVAR.subtype the number of the field codepage is specified in the lower byte, while the collate number is specified in the upper byte.

When recording text fields, you convert a text to UTF-8, and then handle it as an ordinary record.

Buffer size and line length parameters are expressed in bytes.

If the field codepage differs from the codepage of connection to database, then the server performs conversion of incoming and outgoing data.

However, for all that, buffer size will be computed according to the column codepage. Thus, if the codepage of the win1252 field and UNICODE_FSS connection is used, and no other additional operations are launched, then you will probably receive the “Cannot transliterate characters between character sets” error message.

The thing is that in win1252 all symbols are single-byte, and buffer for such column will be required reasoning from 1 byte for 1 character. When recoding to UTF-8, characters with a code more than 127 will become at least double- byte, and this may result in overflow.

So please consider it, if you want to use connections with several codepages in the context of a single database.

BLOB text fields

As is well known, the link to data is stored in the InterBase record, which contains a BLOB-field (subtype independent). The data are stored separately and handled by InterBase API functions, designed specially for working with BLOB.

That is to say, BLOB-fields data are read and written with separate calls of API. That is why the rules of working with text information in BLOB differ from the ones of working with CHAR/VARCHAR.

The basic rule is: take care of yourself. By default, the server does not interpret contents of BLOB-fields, and treats them as ordinary binary information. Thus, the client becomes responsible for data recoding.

Usually it is enough to use UNICODE_FSS coding on the client, and not use UNICODE_FSS and onebyte coding simultaneously, since conversion to one-byte coding can cause data loss.

Arrays

UNICODE support in text arrays is similar to the one for text fields.

As in the text fields case, the server, when working with text arrays, operates on the byte-level, not on the level of characters. Therefore, the number of characters in a string written in an array cell, may exceed the length specified when the text array column had been created.

In exactly the same way, the server supports recoding of input and output arrays data, taking into account the connection codepage.

SQL query text

Strange as it may seem at first sight, SQL query text is also must use a codepage for connection to a database. The point is that a SQL query text is one of the methods of parameter values explicit transfer.

There are no severe limitations, except for general length of an SQL query, which is 64K. When converting a query containing national coding characters to UTF-8, the resulting text may be larger than the source one, and thus it would exceed 64K limit.

Access components for working with UNICODE_FSS

Generally speaking, support of a specific codepage means that the access to database component is able to ensure client’s work with other codepage. To do that, it is enough to guarantee conversion of text data (at that, UCS-2, an intermediate format is used). Conversion of text columns, arrays and SQL- queries is not a problem at all, but BLOB fields conversion is quite a laborious task, especially when accessing the BLOB field data through a stream mechanism.

Theoretically, client application, of course, should not depend on the codepages’ differences. Therefore, access components must block all possible ways of text data transfer, and provide necessary information recoding.

One Response to “Working with UNICODE in InterBase/Firebird”

  1. Firebird News » All you didn’t want to know about Firebird’s Unicode Says:

    […] Dmitri Kovalenko, LCPI wrote about unicode in September issue of The Firebird Developer Magazine:Working with UNICODE in InterBase/Firebird […]

Leave a Reply

You must be logged in to post a comment.