Relational Design Generation: UPDATED: Unloading/Loading NLS (National Language Support) Data in Oracle

What is the problem?

Every Oracle database has 2 different character sets for "string" data types. These character sets are defined whenever a database is created.

CHARACTER SET - This is the main character set for the database. It is set for specific environment and application needs. This character set is normally used to store and handle all string data.
NATIONAL CHARACTER SET - This is the NLS (National Language Support) character set and is available for multi-lingual string data storage on all Oracle databases. It can be set to one of these UNICODE character sets: "AL16UTF16" or "UTF8".

Oracle's DB Sample Schemas use the NLS character set for multi-lingual string data. This data is loaded with SQL INSERT statements. However, we have a need to unload and load this data to/from text files.

The problem is that text files will generally have string data for the database's main character set. A different technique is needed to unload and load the NLS string data. Oracle's DB Sample Schemas uses the UNISTR format to segregate NLS string data from string data that uses the database's main character set.

We will be using the UNISTR format as well.

How to Unload NLS String Data to a Text File?

The US7ASCII character set is a common subset of almost all main character sets for an Oracle database. This is important because any NLS string data saved to a text file needs a common format with the database's main character set. Also note the US7ASCII character set only uses the 7 LSBs (least significant bits) for each character. Characters in this range can be found when Oracle's ASCII function returns decimal values between 0 and 127. Additionally, the "\" character needs to be "escaped" with another "\" character.

We used the following function to encode database NLS string data to UNISTR format. The resulting string data from this function will only contain US7ASCII characters and can be safely stored in a text file. *UPDATE:* This function has been updated to accommodate "Supplementary characters ... high-surrogates ... and ... low-surrogates" as referenced in Oracle 21c Documentation This multi-byte encoding was missing from the original post. More documentation on this implementation is provided at ODBCapture Wiki Issue Z0034

function unistr_encode
(in_nclob in nclob)
return clob
is
hxstr varchar2(20);
acode number;
len number := length(in_nclob);
pos number := 1;
oclob clob;
begin
while pos <= len
loop
acode := ascii(substr(in_nclob,pos,2)); -- Allowance for multi-byte characters
case
when acode > 127
then
hxstr := to_char(acode,'FM0000000X');
If acode > 65535
then
if acode < 16777216
then
raise_application_error(-20000, 'Unable to handle 3-byte character encoding');
else
-- Unicode Supplemental Characters
oclob := oclob || '\' || substr(hxstr,1,4) || '\' || substr(hxstr,5,4);
pos := pos + 2; -- Must advance 2 positions for Unicode Supplemental Characters
end if;
else
-- Unicode Characters
oclob := oclob || '\' || substr(hxstr,5,4);
pos := pos + 1;
end if;
when acode = 92 -- "\" Character
then
-- Escape the "\" Character
oclob := oclob || '\\';
pos := pos + 1;
else
-- US7ASCII Character
oclob := oclob || chr(acode);
pos := pos + 1;
end case;
end loop;
return oclob;
exception when OTHERS then
raise_application_error(-20000, 'UNISTR Encoding Error. len: ' || len ||
', pos: ' || pos ||
', acode: ' || acode ||
', hxstr: ' || hxstr || LF ||
SQLERRM || LF ||
dbms_utility.format_error_backtrace );
end unistr_encode;

How to Load NLS String Data from a Text File

We chose to unload our data to a CSV file. We also chose to load our data from the CSV file using SQL*Loader. Below is our control file to load the OE.PRODUCT_DESCRIPTIONS data from Oracle's DB Sample Schemas.

Our CSV file includes a Header Record with the names of the columns in the file.
We don't know why the "to_char" conversion is required for the UNISTR function. However, if it is omitted, the error "ORA-30186: '/' must be followed by four hexdecimal characters or another '/'" will be thrown.
5x storage space is allocated for the TRANSLATED_NAME and TRANSLATED_DESCRIPTION fields to allow for UNISTR encoding characters.

OPTIONS (SKIP=1)
LOAD DATA
APPEND INTO TABLE "OE"."PRODUCT_DESCRIPTIONS"
FIELDS CSV WITH EMBEDDED
TRAILING NULLCOLS
(PRODUCT_ID FLOAT EXTERNAL
,LANGUAGE_ID CHAR(3)
,TRANSLATED_NAME CHAR(250) "UNISTR(to_char(:TRANSLATED_NAME))"
,TRANSLATED_DESCRIPTION CHAR(10000) "UNISTR(to_char(:TRANSLATED_DESCRIPTION))"
)

The above example will not work with an NCLOB data type because the UNISTR function has a 16383 size limit in PL/SQL (4000 size limit in SQL). For an NCLOB column, a procedure is needed to break up the data into smaller chunks for UNISTR. In the following example, the AD_FLTEXTN column in the PM.PRINT_MEDIA table is used because it is an NCLOB data type.

declare
g_nclob nclob;
cnt pls_integer := 0;
-- Convert UNISTR encoding to National Character Set
function unistr_decode(in_nclob in nclob) return nclob
is
MAXLEN number := 16000;
l_nclob nclob;
len number := length(in_nclob);
subpos number := 1;
slashpos number;
numchars number;
begin
while subpos <= len - MAXLEN
loop
-- Search for "\" characters in the last 5 positions before MAXLEN
slashpos := instr(substr(in_nclob, subpos-1 + MAXLEN-4, 5), '\');
if slashpos > 0
then
-- Check for Escape Character before slashpos (another slash)
if slashpos = 1
AND substr(in_nclob, subpos-1 + MAXLEN-4-1, 1) = '\'
then
-- Escape Character Found Before, cut off before Escape Character
numchars := MAXLEN-4 - 1 - 1;
l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
-- Check for Escape Character after slashpos (another slash)
elsif substr(in_nclob, subpos-1 + MAXLEN-4 + slashpos-1 + 1, 1) = '\'
then
-- Escape Character Found After, cut off before slashpos
numchars := MAXLEN-4 + slashpos-1 - 1;
l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
-- Check the slashpos - 5 (+1 Character) for a "Unescaped" Unicode Supplemental Character
elsif regexp_like(substr(in_nclob, subpos-1 + MAXLEN-4 + slashpos-1 - 5, 6)
,'^[^\][\]D(8|9|A|B)[0-9][0-9]$')
-- ^ - Anchored to start of line
-- [^\] - Not a backslash character
-- [\] - A backslash character
-- D(8|9|A|B) - Either D8, D9, DA, or DB
-- [0-9][0-9] - 2 digit number
-- $ - Anchored to end of line
then
-- Unicode Supplemental Character Found, Cut off before Supplemental
numchars := MAXLEN-4 + slashpos-1 - 5 - 1;
l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
else
-- No Escape Character or Supplemental Character found, Cut off before slashpos
numchars := MAXLEN-4 + slashpos-1 - 1;
l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
end if;
else
-- Cut off at MAXLEN
numchars := MAXLEN;
l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
end if;
subpos := subpos + numchars;
end loop;
return l_nclob || UNISTR(to_char(substr(in_nclob, subpos, MAXLEN)));
exception when OTHERS then
raise_application_error(-20000, 'UNISTR Encoding Error. len: ' || len ||
', subpos: ' || subpos ||
', slashpos: ' || slashpos ||
', numchars: ' || numchars ||
', cnt: ' || cnt || CHR(10) ||
SQLERRM || CHR(10) ||
dbms_utility.format_error_backtrace );
end unistr_decode;
begin
dbms_output.put_line('Decode UNISTR Data');
dbms_output.put_line('------------------');
-- Decode UNISTR Data for column AD_FLTEXTN
for buff in (select ROWID RID, "AD_FLTEXTN"
from "PM"."PRINT_MEDIA"
where "AD_FLTEXTN" is not null
for update of "AD_FLTEXTN")
loop
g_nclob := unistr_decode(buff."AD_FLTEXTN");
-- This overwrites the UNISTR Encoded String with the original UNICODE data
update "PM"."PRINT_MEDIA"
set "AD_FLTEXTN" = g_nclob
where rowid = buff.rid;
cnt := cnt + 1;
end loop;
dbms_output.put_line(cnt || ' "PRINT_MEDIA" updates.');
end;
/

Relational Design Generation

Friday, October 4, 2024

UPDATED: Unloading/Loading NLS (National Language Support) Data in Oracle

What is the problem?

How to Unload NLS String Data to a Text File?

How to Load NLS String Data from a Text File

No comments:

Post a Comment

Blog Archive

Labels