Systems Engineering and RDBMS

Preserving multi-byte characters in a DML Script

Posted by decipherinfosys on March 15, 2007

With the growth of the global economy, and companies expanding their horizons to the far east, we are having to deal more and more multi-byte character sets. It is becoming more common that applications are being designed with more than just western languages (English, German, French) in mind. On the same note, it is becoming more common that DBA’s (western DBA’s) are having to deal with more than just single byte language characters.

Multi-bytes character languages offer up an array of challenges for those of us that are beginning deal with them. One of those challenges is transferring data from multi-byte character compatible machine to another. You can do this easily using import and export, but what if what you really want is an easy to run DML script? It seems that would be no problem, however what if this script gets saved to a machine that does not support multi-byte character sets before it reaches it’s finally destination? Data corruption will occur. The muti-byte characters will not be preserved and the data in that DML script will have become corrupted.

In this blog post, we are going to provide a solution to overcome this. The post takes the example of Oracle and in the end, we will mention the functions that can be used to do the same in SQL Server and DB2 LUW.

This solution makes use of the unistr() and ascistr() functions built into Oracle.

ascistr() – takes as its argument a string in any character set and returns an ASCII version of the string. Non-ASCII characters are converted to the form \xxxx, where xxxx represents a UTF-16 code unit.

unistr() – provides support for Unicode string literals by letting you specify the Unicode encoding value of characters in the string. It converts the UTF-16 code units (\xxxx, where the “\” escapes the unit) to the national character set.

This is useful, because it will allow us to encode our multi-byte character (Chinese for example) into UTF-16 encoded representatives. Which will look something like \76EE. In simpler terms, this is single byte character representation of a multi-byte character. It is generated when a non-ascii character is read by the ascistr function. To transform it back to it’s non-ascii value, you will need to use the unistr. However, if you use unistr on a single byte character machine, you will get strange characters, such as an upside down question mark. It is important that the language and character semantics on the source database and destination database be similar or compatible.

Below is a simple example of using these functions to generate a DML that will preserve the multi-byte data.

–*******************************************
–SET UP FOR THE EXAMPLE
–*******************************************

–CREATE A TABLE TO HOLD CHINESE DATA.
create table chinese_words
(col1 varchar2(100));

–CREATE SOME CHINESE DATA.
insert into chinese_words (col1) values(unistr(‘\76EE\7684\5730 is chinese’));

–at this point, you should be able to query from the chinese_words —–table and see the chinese data.

–====================================================================

–*****************************************************************
–example of how to extract the chinese data into a dml statement.
–*****************************************************************

–generate the insert statement for the chinese data.
select ‘insert into chinese_words ‘ || asciistr(col1) || ‘from chinese_data;’ from chinese_words;

–the result you get is the following. this is what
–you would run on the destination database. if you have multiple
–rows, then you will see more insert statements. this can also be done
–using an explicit cursor or cursor for loop for more sophisticated –
–scenarios.

insert into chinese_words (col1)
values(unistr(‘\76EE\7684\5730 is chinese));

You can achieve the same thing in SQL Server using the NCHAR() and UNICODE() functions and using the GRAPHIC and VARGRAPHIC() functions in DB2 LUW.

DB2 supports two types of hexadecimal graphic string constants:

UX’xxxx’ represents a string of graphic Unicode UTF-16 characters, where x is a hexadecimal digit. The number of digits must be a multiple of 4 and must not exceed 32704. Each group of 4 digits represents a single UTF-16 graphic character. For example, the UX constant for ‘ABC’ is UX’004100420043′.

GX’xxxx’ represents a string of graphic characters, where x is a hexadecimal digit. The number of digits must be a multiple of 4. Each group of 4 digits represents a single double-byte graphic character. The hexadecimal shift-in and shift-out (‘OE’X and ‘OF’X), which apply to EBCDIC only, are not included in the string.
If the MIXED DATA install option is set to NO, a GX constant cannot be used. Instead, a UX constant should be used. A GX constant cannot be used when the encoding scheme is UNICODE.

Sorry, the comment form is closed at this time.

 
%d bloggers like this: