Friday, October 4, 2024

UPDATED: Unloading/Loading NLS (National Language Support) Data in Oracle

What is the problem?

Every Oracle database has 2 different character sets for "string" data types.  These character sets are defined whenever a database is created.

  • CHARACTER SET - This is the main character set for the database. It is set for specific environment and application needs.  This character set is normally used to store and handle all string data.
  • NATIONAL CHARACTER SET - This is the NLS (National Language Support) character set and is available for multi-lingual string data storage on all Oracle databases.  It can be set to one of these UNICODE character sets: "AL16UTF16" or "UTF8".
Oracle's DB Sample Schemas use the NLS character set for multi-lingual string data.  This data is loaded with SQL INSERT statements.  However, we have a need to unload and load this data to/from text files.

The problem is that text files will generally have string data for the database's main character set.  A different technique is needed to unload and load the NLS string data.  Oracle's DB Sample Schemas uses the UNISTR format to segregate NLS string data from string data that uses the database's main character set.

We will be using the UNISTR format as well.

How to Unload NLS String Data to a Text File?

The US7ASCII character set is a common subset of almost all main character sets for an Oracle database.  This is important because any NLS string data saved to a text file needs a common format with the database's main character set.  Also note the US7ASCII character set only uses the 7 LSBs (least significant bits) for each character.  Characters in this range can be found when Oracle's ASCII function returns decimal values between 0 and 127.  Additionally, the "\" character needs to be "escaped" with another "\" character.

We used the following function to encode database NLS string data to UNISTR format.  The resulting string data from this function will only contain US7ASCII characters and can be safely stored in a text file.  *UPDATE:* This function has been updated to accommodate "Supplementary characters ... high-surrogates ... and ... low-surrogates" as referenced in Oracle 21c Documentation  This multi-byte encoding was missing from the original post.  More documentation on this implementation is provided at ODBCapture Wiki Issue Z0034

function unistr_encode
      (in_nclob  in nclob)
   return clob
is
   hxstr  varchar2(20);
   acode  number;
   len    number := length(in_nclob);
   pos    number := 1;
   oclob  clob;
begin
   while pos <= len
   loop
      acode := ascii(substr(in_nclob,pos,2));  -- Allowance for multi-byte characters
      case
      when acode > 127
      then
         hxstr := to_char(acode,'FM0000000X');
         If acode > 65535
         then
            if acode < 16777216
            then
               raise_application_error(-20000, 'Unable to handle 3-byte character encoding');
            else
               -- Unicode Supplemental Characters
               oclob := oclob || '\' || substr(hxstr,1,4) || '\' || substr(hxstr,5,4);
               pos := pos + 2;    -- Must advance 2 positions for Unicode Supplemental Characters
            end if;
         else
            -- Unicode Characters
            oclob := oclob || '\' || substr(hxstr,5,4);
            pos := pos + 1;
         end if;
      when acode = 92 -- "\" Character
      then
         -- Escape the "\" Character
         oclob := oclob || '\\';
         pos := pos + 1;
      else
         -- US7ASCII Character
         oclob := oclob || chr(acode);
         pos := pos + 1;
      end case;
   end loop;
   return oclob;
exception when OTHERS then
   raise_application_error(-20000, 'UNISTR Encoding Error. len: ' || len   ||
                                                        ', pos: ' || pos   ||
                                                      ', acode: ' || acode ||
                                                      ', hxstr: ' || hxstr || LF ||
                                                                  SQLERRM  || LF ||
                                             dbms_utility.format_error_backtrace );
end unistr_encode;

How to Load NLS String Data from a Text File

We chose to unload our data to a CSV file.  We also chose to load our data from the CSV file using SQL*Loader.  Below is our control file to load the OE.PRODUCT_DESCRIPTIONS data from Oracle's DB Sample Schemas.
  1. Our CSV file includes a Header Record with the names of the columns in the file.
  2. We don't know why the "to_char" conversion is required for the UNISTR function.  However, if it is omitted, the error "ORA-30186: '/' must be followed by four hexdecimal characters or another '/'" will be thrown.
  3. 5x storage space is allocated for the TRANSLATED_NAME and TRANSLATED_DESCRIPTION fields to allow for UNISTR encoding characters.
OPTIONS (SKIP=1)
LOAD DATA
APPEND INTO TABLE "OE"."PRODUCT_DESCRIPTIONS"
FIELDS CSV WITH EMBEDDED
TRAILING NULLCOLS
   (PRODUCT_ID                     FLOAT EXTERNAL
   ,LANGUAGE_ID                    CHAR(3)
   ,TRANSLATED_NAME                CHAR(250) "UNISTR(to_char(:TRANSLATED_NAME))"
   ,TRANSLATED_DESCRIPTION         CHAR(10000) "UNISTR(to_char(:TRANSLATED_DESCRIPTION))"
   )

The above example will not work with an NCLOB data type because the UNISTR function has a 16383 size limit in PL/SQL (4000 size limit in SQL).  For an NCLOB column, a procedure is needed to  break up the data into smaller chunks for UNISTR.  In the following example, the AD_FLTEXTN column in the PM.PRINT_MEDIA table is used because it is an NCLOB data type.

declare
   g_nclob  nclob;
   cnt      pls_integer := 0;
   -- Convert UNISTR encoding to National Character Set
   function unistr_decode(in_nclob in nclob) return nclob
   is
      MAXLEN    number := 16000;
      l_nclob   nclob;
      len       number := length(in_nclob);
      subpos    number := 1;
      slashpos  number;
      numchars  number;
   begin
      while subpos <= len - MAXLEN
      loop
         -- Search for "\" characters in the last 5 positions before MAXLEN
         slashpos := instr(substr(in_nclob, subpos-1 + MAXLEN-4, 5), '\');
         if slashpos > 0
         then
            -- Check for Escape Character before slashpos (another slash)
            if    slashpos = 1
              AND substr(in_nclob, subpos-1 + MAXLEN-4-1, 1) = '\'
            then
               -- Escape Character Found Before, cut off before Escape Character
               numchars := MAXLEN-4 - 1 - 1;
               l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
            -- Check for Escape Character after slashpos (another slash)
            elsif substr(in_nclob, subpos-1 + MAXLEN-4 + slashpos-1 + 1, 1) = '\'
            then
               -- Escape Character Found After, cut off before slashpos
               numchars := MAXLEN-4 + slashpos-1 - 1;
               l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
            -- Check the slashpos - 5 (+1 Character) for a "Unescaped" Unicode Supplemental Character
            elsif regexp_like(substr(in_nclob, subpos-1 + MAXLEN-4 + slashpos-1 - 5, 6)
                             ,'^[^\][\]D(8|9|A|B)[0-9][0-9]$')
                             --   ^          - Anchored to start of line
                             --   [^\]       - Not a backslash character
                             --   [\]        - A backslash character
                             --   D(8|9|A|B) - Either D8, D9, DA, or DB
                             --   [0-9][0-9] - 2 digit number
                             --   $          - Anchored to end of line
            then
               -- Unicode Supplemental Character Found, Cut off before Supplemental
               numchars := MAXLEN-4 + slashpos-1 - 5 - 1;
               l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
            else
               -- No Escape Character or Supplemental Character found, Cut off before slashpos
               numchars := MAXLEN-4 + slashpos-1 - 1;
               l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
            end if;
         else
            -- Cut off at MAXLEN
            numchars := MAXLEN;
            l_nclob := l_nclob || UNISTR(to_char(substr(in_nclob, subpos, numchars)));
         end if;
         subpos := subpos + numchars;
      end loop;
      return l_nclob || UNISTR(to_char(substr(in_nclob, subpos, MAXLEN)));
   exception when OTHERS then
      raise_application_error(-20000, 'UNISTR Encoding Error. len: ' || len      ||
                                                        ', subpos: ' || subpos   ||
                                                      ', slashpos: ' || slashpos ||
                                                      ', numchars: ' || numchars ||
                                                           ', cnt: ' || cnt      || CHR(10) ||
                                                                        SQLERRM  || CHR(10) ||
                                                        dbms_utility.format_error_backtrace );
   end unistr_decode;
begin
   dbms_output.put_line('Decode UNISTR Data');
   dbms_output.put_line('------------------');
   -- Decode UNISTR Data for column AD_FLTEXTN
   for buff in (select ROWID RID, "AD_FLTEXTN"
                 from  "PM"."PRINT_MEDIA"
                 where "AD_FLTEXTN" is not null
                 for update of "AD_FLTEXTN")
   loop
      g_nclob := unistr_decode(buff."AD_FLTEXTN");
      -- This overwrites the UNISTR Encoded String with the original UNICODE data
      update "PM"."PRINT_MEDIA"
        set  "AD_FLTEXTN" = g_nclob
       where rowid = buff.rid;
      cnt := cnt + 1;
   end loop;
   dbms_output.put_line(cnt || ' "PRINT_MEDIA" updates.');
end;
/


No comments:

Post a Comment