What is the problem?
Every Oracle database has 2 different character sets for "string" data types. These character sets are defined whenever a database is created.
- CHARACTER SET - This is the main character set for the database. It is set for specific environment and application needs. This character set is normally used to store and handle all string data.
- NATIONAL CHARACTER SET - This is the NLS (National Language Support) character set and is available for multi-lingual string data storage on all Oracle databases. It can be set to one of these UNICODE character sets: "AL16UTF16" or "UTF8".
The problem is that text files will generally have string data for the database's main character set. A different technique is needed to unload and load the NLS string data. Oracle's DB Sample Schemas uses the UNISTR format to segregate NLS string data from string data that uses the database's main character set.
We will be using the UNISTR format as well.
How to Unload NLS String Data to a Text File?
The US7ASCII character set is a common subset of almost all main character sets for an Oracle database. This is important because any NLS string data saved to a text file needs a common format with the database's main character set. Also note the US7ASCII character set only uses the 7 LSBs (least significant bits) for each character. Characters in this range can be found when Oracle's ASCII function returns decimal values between 0 and 127. Additionally, the "\" character needs to be "escaped" with another "\" character.
We used the following function to encode database NLS string data to UNISTR format. The resulting string data from this function will only contain US7ASCII characters and can be safely stored in a text file. *UPDATE:* This function has been updated to accommodate "Supplementary characters ... high-surrogates ... and ... low-surrogates" as referenced in Oracle 21c Documentation This multi-byte encoding was missing from the original post. More documentation on this implementation is provided at ODBCapture Wiki Issue Z0034
function unistr_encode(in_nclob in nclob)return clobishxstr varchar2(20);acode number;len number := length(in_nclob);pos number := 1;oclob clob;beginwhile pos <= lenloopacode := ascii(substr(in_nclob,pos,2)); -- Allowance for multi-byte characterscasewhen acode > 127thenhxstr := to_char(acode,'FM0000000X');If acode > 65535thenif acode < 16777216thenraise_application_error(-20000, 'Unable to handle 3-byte character encoding');else-- Unicode Supplemental Charactersoclob := oclob || '\' || substr(hxstr,1,4) || '\' || substr(hxstr,5,4);pos := pos + 2; -- Must advance 2 positions for Unicode Supplemental Charactersend if;else-- Unicode Charactersoclob := oclob || '\' || substr(hxstr,5,4);pos := pos + 1;end if;when acode = 92 -- "\" Characterthen-- Escape the "\" Characteroclob := oclob || '\\';pos := pos + 1;else-- US7ASCII Characteroclob := oclob || chr(acode);pos := pos + 1;end case;end loop;return oclob;exception when OTHERS thenraise_application_error(-20000, 'UNISTR Encoding Error. len: ' || len ||', pos: ' || pos ||', acode: ' || acode ||', hxstr: ' || hxstr || LF ||SQLERRM || LF ||dbms_utility.format_error_backtrace );end unistr_encode;
How to Load NLS String Data from a Text File
We chose to unload our data to a CSV file. We also chose to load our data from the CSV file using SQL*Loader. Below is our control file to load the OE.PRODUCT_DESCRIPTIONS data from Oracle's DB Sample Schemas.
- Our CSV file includes a Header Record with the names of the columns in the file.
- We don't know why the "to_char" conversion is required for the UNISTR function. However, if it is omitted, the error "ORA-30186: '/' must be followed by four hexdecimal characters or another '/'" will be thrown.
- 5x storage space is allocated for the TRANSLATED_NAME and TRANSLATED_DESCRIPTION fields to allow for UNISTR encoding characters.
OPTIONS (SKIP=1)LOAD DATAAPPEND INTO TABLE "OE"."PRODUCT_DESCRIPTIONS"FIELDS CSV WITH EMBEDDEDTRAILING NULLCOLS(PRODUCT_ID FLOAT EXTERNAL,LANGUAGE_ID CHAR(3),TRANSLATED_NAME CHAR(250) "UNISTR(to_char(:TRANSLATED_NAME))",TRANSLATED_DESCRIPTION CHAR(10000) "UNISTR(to_char(:TRANSLATED_DESCRIPTION))")
The above example will not work with an NCLOB data type because the UNISTR function has a 16383 size limit in PL/SQL (4000 size limit in SQL). For an NCLOB column, a procedure is needed to break up the data into smaller chunks for UNISTR. In the following example, the AD_FLTEXTN column in the PM.PRINT_MEDIA table is used because it is an NCLOB data type.
declareg_nclob nclob;cnt pls_integer := 0;-- Convert UNISTR encoding to National Character Setfunction unistr_decode(in_nclob in nclob) return nclobisMAXLEN number := 16000;l_nclob nclob;len number := length(in_nclob);subpos number := 1;slashpos number;numchars number;beginwhile subpos <= len - MAXLENloop-- Search for "\" characters in the last 5 positions before MAXLENslashpos := instr(substr(in_nclob, subpos-1 + MAXLEN-4, 5), '\');if slashpos > 0then-- Check for Escape Character before slashpos (another slash)if slashpos = 1AND substr(in_nclob, subpos-1 + MAXLEN-4-1, 1) = '\'then-- Escape Character Found Before, cut off before Escape Characternumchars := MAXLEN-4 - 1 - 1;l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));-- Check for Escape Character after slashpos (another slash)elsif substr(in_nclob, subpos-1 + MAXLEN-4 + slashpos-1 + 1, 1) = '\'then-- Escape Character Found After, cut off before slashposnumchars := MAXLEN-4 + slashpos-1 - 1;l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));-- Check the slashpos - 5 (+1 Character) for a "Unescaped" Unicode Supplemental Characterelsif regexp_like(substr(in_nclob, subpos-1 + MAXLEN-4 + slashpos-1 - 5, 6),'^[^\][\]D(8|9|A|B)[0-9][0-9]$')-- ^ - Anchored to start of line-- [^\] - Not a backslash character-- [\] - A backslash character-- D(8|9|A|B) - Either D8, D9, DA, or DB-- [0-9][0-9] - 2 digit number-- $ - Anchored to end of linethen-- Unicode Supplemental Character Found, Cut off before Supplementalnumchars := MAXLEN-4 + slashpos-1 - 5 - 1;l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));else-- No Escape Character or Supplemental Character found, Cut off before slashposnumchars := MAXLEN-4 + slashpos-1 - 1;l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));end if;else-- Cut off at MAXLENnumchars := MAXLEN;l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));end if;subpos := subpos + numchars;end loop;return l_nclob || UNISTR(to_char(substr(in_nclob, subpos, MAXLEN)));exception when OTHERS thenraise_application_error(-20000, 'UNISTR Encoding Error. len: ' || len ||', subpos: ' || subpos ||', slashpos: ' || slashpos ||', numchars: ' || numchars ||', cnt: ' || cnt || CHR(10) ||SQLERRM || CHR(10) ||dbms_utility.format_error_backtrace );end unistr_decode;begindbms_output.put_line('Decode UNISTR Data');dbms_output.put_line('------------------');-- Decode UNISTR Data for column AD_FLTEXTNfor buff in (select ROWID RID, "AD_FLTEXTN"from "PM"."PRINT_MEDIA"where "AD_FLTEXTN" is not nullfor update of "AD_FLTEXTN")loopg_nclob := unistr_decode(buff."AD_FLTEXTN");-- This overwrites the UNISTR Encoded String with the original UNICODE dataupdate "PM"."PRINT_MEDIA"set "AD_FLTEXTN" = g_nclobwhere rowid = buff.rid;cnt := cnt + 1;end loop;dbms_output.put_line(cnt || ' "PRINT_MEDIA" updates.');end;/
No comments:
Post a Comment