GT.M Version 5.2-000 Technical Bulletin

GT.M Support for the Unicode Standard

Legal Notice

GT.M™ is a trademark of Fidelity National Information Services, Inc.

Unicode™ - "Unicode is a trademark of Unicode, Inc."

Unicode® - "Unicode is a registered trademark of Unicode, Inc."

GT.M and its documentation are provided pursuant to license agreements containing restrictions on their use. They are the copyrighted intellectual property of Fidelity National Information Services, Inc. and Sanchez Computer Associates, LLC (collectively "Fidelity") and are protected by U.S. copyright law. They may not be copied or distributed in any form or medium, disclosed to third parties, or used in any manner not authorized in said license agreement except with prior written authorization from Fidelity.

This document contains a description of Fidelity products and the operating instructions pertaining to the various functions that comprise the system. It should not be construed as a commitment of Fidelity. Fidelity believes the information in this publication is accurate as of its publication date; such information is subject to change without notice. Fidelity is not responsible for any inadvertent errors.

August 16, 2010

Revision History
Revision 1.3August 20, 2010

In Limitations, replaced "UTF-16 is not supported for $PRINCIPAL device" with a new section called "$PRINCIPAL device encoding is determined at process startup".

Revision 1.2March 22, 2010

Fixed broken links to GT.M Programmers Guide.

Revision 1.1January 19, 2007

First Published Version

Contact Information

GT.M Group
Fidelity National Information Services, Inc.
2 West Liberty Boulevard, Suite 300
Malvern, PA 19355
United States of America

GT.M Support: +1 (610) 578-4226
Switchboard: +1 (610) 296-8877
Fax: +1 (484) 595-5101
Website: http://fis-gtm.com
Email: gtmsupport@fnis.com


Table of Contents

Introduction
Theory of Operation
Philosophy
What is a character? A glyph or a Unicode™ code-point?
ICU
M Language
M and UTF-8 mode
Pattern Match Operator (?)
Commands
I/O Commands
String Processing functions
$Z Equivalent Functions
New $Z Functions
Intrinsic Special Variables
User-defined Collation
Compiling and Linking
Environment variables
Utility Programs
GDE
MUPIP
DSE & LKE
M Utility Routines
Discussion and Best Practices
Data interchange
Limitations
Performance and Capacity
Maximums
Ten Golden Rules

Introduction

Starting with the V5.2-000 release, GT.M provides support for Unicode™ version 5.0.0. Releases of the Unicode™ Standard and releases of ISO/IEC-10646 track each other (see http://www.unicode.org/faq/unicode_iso.html for more information).

The objective of this technical bulletin is to describe the enhancements to GT.M language features and utility programs in V5.2-000 using practical examples, discussion summaries, and best practices.

An understanding of Unicode ™ and GT.M is a prerequisite to using the Unicode™-related features of GT.M. For information on Unicode™, refer to:

This technical bulletin has five parts:

  • Theory of Operation: This section explains the philosophy behind support for Unicode ™ on GT.M and summarizes enhancements that support it, especially the concept that there is no change to the GT.M database engine and Unicode™-related functionality of GT.M is simply another way to interpret the stings of bytes stored in the database files.

  • M Language: This section covers the enhancements to M Language Commands, String Processing Functions, and explains how GT.M works with the UTF-8 character set. It describes Unicode™ strings, I/O, and so on. Together with Theory of Operation and Utility Programs, this section provides information application developers need to develop applications using Unicode™.

  • Utility Programs: This section covers changes in MUPIP, DSE, and LKE.

  • Discussion and Best Practices: This section discusses the best practices for data interchange between M character set and UTF-8, limitations and maximums of V5.2-000, and ten rules to design and develop Unicode™-based applications for deployment on GT.M.

Theory of Operation

Philosophy

When designing new GT.M functionality the GT.M team has a dedication to upward compatibility. Unaltered existing applications deployed on the previous production release of GT.M exhibits unaltered behavior on V5.2-000.

There is no change to the GT.M database engine or to the way that data is stored and manipulated in the engine. GT.M has always allowed indexes and values of M global and local variables to be either canonical numbers or any arbitrary sequence of bytes, and this does not change in any way with support for Unicode™.

There is also no change to the character set used for M source programs. M source programs have always been in ASCII (standard ASCII - $C(0) through $C(127) - is a proper subset of the UTF-8 encoding specified by the Unicode™ standard). GT.M accepts some non-ASCII characters in comments and string literals.

Unicode™-related functionality of GT.M is an optional alternative way to input, output, and interpret as strings the arbitrary sequences of bytes in the indexes and values of global and local variables. The changes in GT.M to support Unicode ™ are principally enhancements to M language features. Although conceptually simple, these changes fundamentally alter certain previously ingrained assumptions. For example:

  • The length of a string in characters is no longer the same as the length of a string in bytes. The length of a Unicode™ string in characters is always less than or equal to its length in bytes.

  • The display width of a string on a terminal is different from the length of a string in characters - for example, with Unicode™, a complex glyph may actually be composed of a series of glyphs or component symbols, each in turn a UTF8 encoded character in a Unicode™ string.

  • As a glyph may be composed of multiple characters, a string in Unicode™ can have canonical and non-canonical forms. The forms may be conceptually equivalent, but they are different strings of characters in Unicode™.

    [Important]

    GT.M V5.2-000 treats canonical and non-canonical versions of the same string as different and unequal. FIS recommends that applications be written to ensure that, for core processing, strings always have a canonical form. Where conformance to a canonical representation of input strings cannot be assured, application logic linguistically and culturally correct for each language must convert non-canonical strings to canonical strings used as indices (global subscripts) to ensure appropriate collation.

Applications may operate on some binary data - for example, some strings in the database may be digitized images of signatures, others may include escape sequences for laboratory instruments. Furthermore, since M applications have traditionally overloaded strings by storing different data items as pieces of the same string, the same string may contain both Unicode™ and binary data. GT.M now has functionality to allow a process to manipulate Unicode™ strings as well as binary data including strings containing both Unicode™ and binary data.

When strings are interpreted as Unicode™, GT.M uses the UTF-8 representation internally. GT.M input / output operations can optionally automatically convert to and from UTF-16, UTF-16LE and UTF-16BE .

The GT.M design philosophy is to keep things simple, but no simpler than they need to be. There are areas of processing where the use of Unicode ™ adds complexity. These typically arise where interpretations of lengths and interpretations of characters interact. For example:

  • A sequence of bytes is never illegal when considered as binary data, but can be illegal when treated as a Unicode ™string. The detection and handling of illegal Unicode™ strings adds complexity, especially when binary and Unicode™ data reside in different pieces of the same string.

  • Since binary data may not map to graphic characters in Unicode™ , the ZWRite format must represent such characters differently. A sequence of bytes that is output by a process interpreting it as Unicode™ may require processing to be correctly input to a process that is interpreting that sequence as binary, and vice versa. Therefore, when performing IO operations, including MUPIP EXTRACT and MUPIP LOAD operations in ZWR format, ensure that processes have the compatible environment variables and /or logic to generate the desired output and correctly read & process the input.

  • Application logic managing input / output that interact with human beings or non-GT.M applications requires even closer scrutiny. For example, fixed length records in files are always defined in terms of bytes. In Unicode™-related operation, an application may output data such that a character would cross a record boundary (for example, a record may have two bytes of space left, and the next UTF8 character may be three bytes long), in which case GT.M fills the record with one or more pad characters. When a padded record is read as UTF-8, trailing pad characters are stripped by GT.M and not provided to the application code.

For some languages (such as Chinese), the ordering of strings according to Unicode™ code-points (character values) may not be the linguistically or culturally correct ordering. Supporting applications in such languages requires development of collation modules - GT.M natively supports M collation, but does not include pre-built collation modules for any specific natural language.

What is a character? A glyph or a Unicode™ code-point?

Glyphs are the visual representation of text elements in writing systems and Unicode™ code-points are the underlying data. Internally, GT.M stores UTF-8 encoded strings as sequences of Unicode™ code-points. A Unicode™ compatible output device - terminal, printer or application - renders the characters as sequences of glyphs that depict the sequence of code-points, but frequently there is not a one-to-one correspondence between characters and glyphs.

For example, consider the following word from the Devanagari writing system.

अच्छी

On a screen or a printer, it is displayed in 4 columns. Internally GT.M stores it as a sequence of 5 Unicode™ code-points:

#

Character

Unicode™ code-point

Name

1

U+0905

DEVANAGARI LETTER A

2

U+091A

DEVANAGARI LETTER CA

3

U+094D

DEVANAGARI SIGN VIRAMA

4

U+091B

DEVANAGARI LETTER CHA

5

U+0940

DEVANAGARI VOWEL SIGN II

The Devanagari writing system (U+0900 to U+097F) is based on the representation of syllables as contrasted with the use of an alphabet in English. Therefore, it uses the half-form of a consonant to represent certain syllables. The above example uses the half-form of the consonant (U+091A).

Although the half-form form consonant is a valid text element in the context of the Devanagari writing system, it does not map directly to a character in the Unicode™ Standard. It is obtained by combining the DEVANAGARI LETTER CA, with DEVANAGARI SIGN VIRAMA, and DEVANAGARI LETTER CHA.

+

+

=

च्छ

On a screen or a printer, the terminal font detects the glyph image of the half-consonant and displays it at the next display position. Internally GT.M uses ICU's glyph-related conventions for the Devanagari writing system to calculate the number of columns needed to display it. As as result, GT.M advances $X by 1 when it encounters the combination of the 3 Unicode™ code-points that represent the half-form consonant.

To view this example at GT.M prompt, type the following command sequence:

GTM>W $ZCHSET
UTF-8
GTM>SET DS=$CHAR($$FUNC^%HD("0905"))_$CHAR($$FUNC^%HD("091A"))_$CHAR($$FUNC^%HD("094D"))_$CHAR($$FUNC^%HD("091B"))_$CHAR($$FUNC^%HD("0940"))
GTM>WRITE $ZWIDTH(DS); 4 columns are required to display local variable DS on the screen. 
4 
GTM>WRITE $LENGTH(DS); DS contains 5 characters or Unicode code-points. 
5

Therefore, for all writing systems supported by Unicode™, a character is a code-point for string processing, network transmission, storage, and retrieval of Unicode™ data whereas a character is a glyph for displaying on the screen or printer. This holds true for many other popular programming languages. Users must keep this distinction in mind throughout the application development life-cycle.

ICU

While GT.M provides a framework for handling characters in Unicode™, it relies on the ICU (International Components for Unicode) library for language specific information.

ICU is a widely used, defacto standard package (see http://icu.sourceforge.net and http://www.ibm.com/software/globalization/icu/ for more information) that GT.M relies on for most operations that require knowledge of the Unicode™ character sets, such as text boundary detection, character string conversion between UTF-8 and UTF-16, and calculating glyph display widths.

[Note]

Unless Unicode™ support is sought for a process (that is, unless the environment variable $gtm_chset is “UTF8”), GT.M processes do not need ICU. In other words, existing, non-Unicode™, applications continue to work on supported platforms without ICU.

An ICU version number is of the form major.minor.milli.micro where major, minor, milli and micro are integers. Two versions that have different major and/or minor version numbers can differ in functionality and API compatibility is not guaranteed. Differences in milli or micro versions are maintenance releases that preserve functionality and API compatibility. ICU reference releases are defined by major and minor version numbers, where the minor version number is even. For example, as of this writing (January, 2007), the latest ICU reference release is version 3.6. When ICU is packaged and distributed with an operating system, the operating system distribution may add its own version information. For example, as of this writing, the Debian GNU/Linux Testing version of package libicu36, which provides ICU 3.6, is 3.6-2.

An operating system's distribution generally include an ICU library tailored to the OS and hardware, therefore a GT.M distribution does not provide any ICU. However, in order to support Unicode™ functionality, GT.M requires an appropriate version of ICU to be installed on the system. Each version of GT.M requires a specific reference release version of ICU. GT.M V5.2-000 requires ICU 3.6. The release notes for each GT.M release identify the required reference release version number as well as the milli and micro version numbers that were used to test GT.M prior to release. In general, it should be safe to use any version of ICU with the specific ICU reference version number required and milli and micro version numbers greater than those identified in the release notes for that GT.M version.

ICU supports multiple threads within a process, and an ICU binary library can be compiled from source code to either support or not support multiple threads. In contrast, GT.M does not support multiple threads within a GT.M process. On some platforms, such as the Debian GNU/Linux Testing (Etch) release, the stock ICU library, which is usually compiled to support multiple threads, may work unaltered with GT.M. On other platforms, it may be required to rebuild ICU from its source files with support for multiple threads turned off. Refer to the release notes for each GT.M release for details about the specific configuration tested and hence formally supported. In general, the GT.M Group's preference for ICU binaries used for each GT.M version are, in decreasing order of preference:

  • The stock ICU binary provided with the operating system distribution.

  • A binary distribution of ICU from the download section of the ICU project page (http://icu.sourceforge.net/download/3.6.html#ICU4C).

  • A version of ICU locally compiled from source code provided by the operating system distribution with a configuration disabling multi-threading.

  • A version of ICU locally compiled from the source code from the ICU project page with a configuration disabling multi-threading.

GT.M uses the POSIX function dlopen() to dynamically link to ICU. In the event you have other applications that require ICU compiled with threads, place the different builds of ICU in different locations, and use the dlopen() search path feature (e.g, the LD_LIBRARY_PATH environment variable on Linux) to enable each application to link with its appropriate ICU.

Compiling ICU

Below are sample instructions to to download ICU, configure it not to use multi-threading, and compile it for various platforms. Note that download sites, versions of compilers, and milli and micro releases of ICU may well change subsequent to the writing of these instructions, and make these instructions obsolete. Therefore, these procedures must be considered examples, not gospel.

Compiling ICU version 3.6 on x86 Linux

As of this writing (January, 2007), ICU version 3.6 can be compiled on x86 Linux with the following configuration:

Operating System

Version

Compilers

Linux

Red Hat Enterprise Linux 4 Update 2

gcc 3.4.4, GNU make (3.77+), ANSI C compiler

Instructions
  1. Ensure that system environment variable PATH includes the location of all the compilers mentioned above.

  2. Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C

  3. At the shell prompt, execute the following commands:

    gunzip -d < icu4c-3_6-src.tgz | tar -xf - 
    cd icu/source/
    chmod +x runConfigureICU configure install-sh       
    runConfigureICU Linux --disable-64bit-libs --disable-threads 
    gmake
    gmake check
    gmake install>
  4. Set the environment variable LD_LIBRARY_PATH to point to the location of ICU. GT.M uses the environment variable LD_LIBRARY_PATH to search for dynamically linked libraries to be loaded.

ICU is now installed in the /usr/local directory.

[Note]

By default, ICU is installed on /usr/local directory. If you need to install ICU on a different directory type:

  1. runConfigureICU Linux --prefix=<install_path> --disable-64bit-libs --disable-threads

  2. Then execute the gmake commands, and set the environment variable LD_LIBRARY_PATH to point to the appropriate location.

Compiling ICU version 3.6 on HP PA-RISC HP-UX

As of this writing (January, 2007), ICU version 3.6 can be compiled on PA-RISC HP-UX with the following configuration:

Operating System

Version

Compilers

HP-UX

HP-UX 11.11

aCC A.03.50, cc B.11.11.08, GNU make (3.77+)

Instructions:
  1. Ensure that system environment variable PATH includes the location of all the compilers mentioned above.

  2. Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C

  3. Add the following line in the configuration file source/config/mh-hpux-acc to include the appropriate C++ runtime libraries:

    DEFAULT_LIBS = -lstd_v2 -lCsup_v2 -lcl
  4. At the shell prompt, execute the following commands:

    gunzip -d < icu4c-3_6-src.tgz | tar -xf - 
    cd icu/source/
    chmod +x runConfigureICU configure install-sh       
    runConfigureICU HP-UX/ACC --disable-64bit-libs --disable-threads
    gmake
    gmake check
    gmake install
  5. Set the environment variable LD_LIBRARY_PATH to point to the location of ICU. HP-UX uses the environment variable LD_LIBRARY_PATH to search for dynamically linked libraries to be loaded.

ICU is now installed in the /usr/local directory.

[Note]

By default, ICU is installed in the /usr/local directory. If you need to install ICU on a different directory type:

  1. runConfigureICU HP-UX/ACC --prefix=<install_path> --disable-64bit-libs -disable-threads

  2. Then execute the gmake commands, and set the environment variable LD_LIBRARY_PATH to point to the appropriate location.

Compiling ICU version 3.6 on Solaris

As of this writing (January, 2007), ICU version 3.6 can be compiled on Solaris with the following configuration:

Operating System

Version

Compiler

Solaris

Solaris 9 (SunOS 5.9)

Sun Studio 8 (Sun C++ 5.5), GNU make (3.77+), ANSI C compiler

Instructions:
  1. Ensure that system environment variable PATH includes the location of all the compilers mentioned above.

  2. Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C

  3. Add the following line in the configuration file source/config/mh-solaris to include the appropriate C++ runtime libraries:

    DEFAULT_LIBS = -lCstd -lCrun -lm -lc
  4. At the shell prompt, execute the following commands:

    gunzip -d < icu4c-3_6-src.tgz | tar -xf -
    cd icu/source/
    chmod +x runConfigureICU configure install-sh       
    runConfigureICU Solaris --disable-64bit-libs --disable-threads
    gmake
    gmake check
    gmake install
  5. Set the environment variable LD_LIBRARY_PATH to point to the location of ICU. Solaris uses the environment variable LD_LIBRARY_PATH to search for dynamically linked libraries to be loaded.

ICU is now installed in the/usr/local directory.

[Note]

By default, ICU is installed in the /usr/local directory. If you need to install ICU on a different directory type:

  1. runConfigureICU Solaris --prefix=<install_path> --disable-64bit-libs -disable-threads

  2. Then execute the gmake commands, and set the environment variable LD_LIBRARY_PATH to point to the appropriate location.

Compiling ICU version 3.6 on AIX

As of this writing (January, 2007), ICU version 3.6 can be compiled on AIX with the following configuration:

Operating System

Version

Compilers

AIX

AIX 5.2 (PowerPC 64-bit)

VisualAge 6, GNU make (3.77+), ANSI C compiler

Instructions:
  1. Ensure that system environment variable PATH includes the location of all the compilers mentioned above.

  2. Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C

  3. At the shell prompt, execute the following commands:

    gunzip -d < icu4c-3_6-src.tgz | tar -xf - 
    cd icu/source/
    chmod +x runConfigureICU configure install-sh       
    runConfigureICU AIX --disable-64bit-libs --disable-threads
    gmake
    gmake check
    gmake install
  4. Set the environment variable LIBPATH to point to the location of ICU. AIX uses the environment variable LIBPATH to search for dynamically linked libraries to be loaded.

ICU is now installed in the /usr/local directory.

[Note]

By default, ICU is installed on /usr/local directory. If you need to install ICU on a different directory type:

  1. runConfigureICU AIX --prefix=<install_path> --disable-64bit-libs -disable-threads

  2. Then, execute the gmake commands, and set the environment variable LIBPATH to point to the appropriate location.

[Important]

AIX includes the release (or minor version) number in the name of the ICU library which can change based on updates from IBM. Users must set the environment variable gtm_icu_minorver to the release number so that GT.M uses that number to activate and access the ICU. If gtm_icu_minorver is not defined, GT.M assumes the release number to be 0.

M Language

The string processing functions now manipulate Unicode™ strings, binary data, or both together in the same process. Input / Output operations now can perform conversion to and from the following character encodings:

  • UTF-8

  • UTF-16

  • UTF-16LE

  • UTF-16BE

[Note]

Any aspect of GT.M not described in this technical bulletin is unchanged from preceding releases.

M and UTF-8 mode

GT.M process can start in two modes - M mode and UTF-8 mode. Any process with the environment variable gtm_chset set to "M" at process entry operates in M mode and exhibits the same behavior as pre-Unicode™ versions. As noted in the Theory of Operation above, unaltered existing applications deployed on the previous production release of GT.M default to "M" mode and exhibit unaltered behavior from earlier releases. GT.M database engine functions identically in M mode and UTF-8 mode.

The changes to GT.M for the support of Unicode™ pertain to the interpretation of strings. A process starts in UTF-8 mode and interprets strings encoded in UTF-8, if at process startup:

  • the environment variable gtm_chset has a value of "UTF-8", and

  • the environment variable LC_CTYPE is set to a locale with UTF-8 support, for example, "zh_CN.utf8"

Note that support for Unicode™ is enabled for the process, not for the database. The indexes and values in the database are simply sequences of bytes and therefore it is possible for one process to interpret a global node as encoded in UTF-8 and for another to interpret the same data as a binary stream. ASCII (codes 0-127) is a subset of Unicode™, and so is available in both modes.

Pattern Match Operator (?)

GT.M allows the pattern string literals to contain the characters in Unicode™ . Additionally, GT.M extends the M standard pattern codes (patcodes) A, C, N, U, L, P and E to the Unicode™ character set. For characters in Unicode™, these patcodes are:

  • A: All alphabetic characters including upper case, lower case and caseless alphabetic characters.

  • C: All control characters

  • E: All characters

  • L: All lower case characters

  • U: All upper case characters

  • N: All digits as specified by the intrinsic special variable $ZPATNUMERIC. If $ZPATNUMERIC is UTF-8, N recognizes all numeric characters as defined by Unicode™ . If $ZPATNUMERIC is "M", N recognizes only ASCII digits (ASCII 48-57) as numeric characters. The default value of the intrinsic special variable $ZPATNUMERIC is M. For a process started in UTF-8 mode, $ZPATNUMERIC takes its value from the environment variable gtm_patnumeric. (see section Environment Variables for more details on configuring this variable).

  • P: All punctuation characters

For characters in Unicode™ , GT.M assigns patcodes based on the default classification of the Unicode™ character set by the ICU library. Note that the above patcodes do not cover all types of characters in the Unicode™ character set. There are several special Unicode™ character classes (such as title case characters) that do not satisfy any of the patcodes above except “E”. The patcode E can be used to match any character in Unicode™ including the characters not covered by the patcodes above as well as malformed characters (if VIEW “NOBADCHAR” setting is enabled). If VIEW “BADCHAR” is enabled, the pattern match operator triggers the BADCHAR error if it encounters an illegal UTF-8 byte sequence in the string.

Commands

Job

The Job command spawns a background process with the same environment as the M process doing the spawning. Therefore, if the parent process is operating in UTF-8 mode, the Job'd process also operates in UTF-8 mode. In the event that a background process must have a different mode from the parent, create a shell script to alter the environment as needed, and spawn it with a ZSYstem command, e.g., ZSYstem "/path/to/shell/script &".

View "[NO]BADCHAR"

In pre-Unicode™ releases, and in M mode, the concept of an illegal character does not exist - all 256 combinations of the 8 bits in a byte are legal characters. In UTF-8 mode, there are certain sequences of bytes that are illegal characters. For example, $ZCHAR(192) is an illegal character because it is a sequence of 2-bytes whose second byte is missing (U+0000).

The [NO]BADCHAR keyword argument for the VIEW command enables or disables the triggering of an error when character-oriented functions encounter malformed byte sequences (illegal characters).

At process startup, GT.M initializes BADCHAR from the environment variable gtm_badchar. Set the environment variable gtm_badchar to a non-zero number or "YES" (or “Y”) to enable VIEW "BADCHAR". Set the environment variable gtm_badchar to 0 or "NO" or "FALSE" (or ”N” or “F”) to enable VIEW "NOBADCHAR". By default, GT.M enables VIEW "BADCHAR".

If VIEW "BADCHAR" is enabled, functions generate the BADCHAR error when they encounter malformed byte sequences. With this setting, GT.M detects and clearly reports potential application program logic errors as soon as they appear. As the an illegal UTF-8 character in the argument of a character-oriented function likely indicates a logic issue, FIS recommends the use of VIEW “BADCHAR” in production environments.

[Note]

When all strings consist of well-formed characters, the value of VIEW [NO]BADCHAR has no effect whatsoever. If VIEW "NOBADCHAR" is enabled, the same functions treat malformed byte sequences as valid characters. During the migration of an application to add support for Unicode™, illegal character errors are likely to be frequent and indicative of application code that is yet to be modified. VIEW "NOBADCHAR" suppresses these errors at times when their presence impedes development.

ZSHow

The ZSHOW command displays information about the current GT.M environment. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode, the ZSHOW command exhibits byte-oriented and display-oriented behavior as follows:

  • ZSHOW targeted to a device (ZSHOW "*") aligns the output according to the numbers of display columns specified by the WIDTH deviceparameter.

  • ZSHOW targeted to a local (ZSHOW "*":lcl) truncates data exceeding 2048KB at the last character that fully fits within the 2048KB limit.

  • ZSHOW targeted to a global (ZSHOW "*":^CC) truncates data exceeding the maximum record size for the target global at the last character that fully fits within that record size.

I/O Commands

As with other areas of functionality, when the environment variable gtm_chset is not set, or is set to "M", there is no change to GT.M I/O behavior. When gtm_chset is set to "UTF-8", GT.M supports Unicode™ I/O.

Even when a process internally stores and manipulates strings encoded in UTF-8, it may nevertheless need to perform I/O on a series of individual bytes, that is, 8-bit octets; a series of bytes that encode characters in UTF-16 with an explicit little endian encoding (UTF-16LE); or a series of bytes that encode characters in UTF-16 with an explicit big endian encoding (UTF-16BE). GT.M allows a process to explicitly specify the encoding by deviceparameters in the OPEN and USE commands. This encoding determines the mode (M mode or UTF-8 mode) of the device.

[Note]

GT.M determines the encoding for $PRINCIPAL from gtm_chset and does not allow the process to change it.

While it is not possible for a byte to be an illegal character when performing I/O on 8-bit octets, when performing I/O on characters in Unicode™ , it is certainly possible for a sequence of bytes to be an illegal character in Unicode™. GT.M READ and WRITE commands check for legal characters and raise the BADCHAR error if they detect a sequence of bytes not corresponding to a legal character. Application code must avoid illegal characters in I/O streams or specify M-mode as VIEW "NOBADCHAR" does not suppress BADCHAR error reporting in I/O.

In M mode, except when FILTER= is in use, a character always has a width of 1. Characters encoded with Unicode™, however, can have different widths according to the current device to which they are applied. For example, the character in the CJK Ideograph occupies 2 display columns on the screen or printer whereas the width of the same character is 1 code-point when it is transmitted through sockets. GT.M handle these differences by defining measurements characteristics of all deviceparameters when they are applied to certain devices.

The RECORDSIZE of a fixed length record for a GT.M sequential disk device is always specified in bytes, rather than characters. In M mode, GT.M only pads a fixed length record when the file is closed and the last record is less than the RECORDSIZE; when READing a padded fixed length record, GT.M returns full record including any PAD characters.

In UTF-8 mode, there are three cases that cause GT.M to insert PAD characters when WRITEing. When READing GT.M attempts to strip any PAD characters. This stripping only works properly if the RECORDSIZE and PAD are the same for the READ as when the WRITEs occurred. WRITE inserts PAD characters when:

  • The file is closed and the last record is less than the RECORDSIZE. Records are padded (for FIXED) by WRITE ! as well as when the file is closed.

  • $X exceeds WIDTH before the RECORDSIZE is full

  • The next character won't fit in the remaining RECORDSIZE. The additional functionality described below supports Unicode™-related operation.

Open

O[PEN][:tvexpr] expr[:[(keyword[=expr][:...])] [:numexpr]][,...]

The OPEN command creates a connection between a GT.M process and a device. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode, the OPEN command recognizes ICHSET, OCHSET, and CHSET as three additional deviceparameters to determine the encoding of the the input / output devices. The next section describes these deviceparameters.

In M mode, the OPEN command ignores ICHSET, OCHSET, CHSET, and PAD device parameters.

[Important]

If an I/O device uses a multi-byte character encoding, every READ and WRITE operation of that device checks for the well-formed characters according to the specified character encoding with ICHSET or OCHSET. If the I/O commands encounter an illegal sequence of bytes, they always trigger a run-time error; a VIEW “NOBADCHAR” does not prevent such errors. Strings created by $ZCHAR() and other Z equivalent functions may contain illegal sequences. The only way to input or output such illegal sequences is to specify character set “M” with one of these deviceparameters.

Open Deviceparameters

OCHSET=expr Applies to: All devices

Establishes the character encoding of the output device. The value of the expression can be M, UTF-8, UTF-16, UTF-16LE, or UTF-16BE.

If the value for OCHSET is not specified, GT.M assumes the value of the intrinsic variable $ZCHSET as the default character set for all the input / output devices and "M" if $ZCHSET is not specified.

If expr is set to a value other than "M", "UTF-8", "UTF-16", "UTF-16LE" or ""UTF-16BE"", GT.M triggers a run-time error .

[Note]

UTF-16, UTF-LE, and UTF-16BE are not supported for $Principal and Terminal devices. Please refer to the limitations section for more details.

Example:
GTM>SET file1="mydata.out"
GTM>SET expr="UTF-16LE"
GTM>OPEN file1:(chset=expr)
GTM>USE file1 WRITE "新年好",!
GTM>CLOSE file1

This example opens a new file called mydata.out and writes the chinese characters "新年好" in the UTF-16LE encoding.

ICHSET=expr Applies to: All devices

Establishes the character encoding of the input device. The value of the expression can be M, UTF-8, UTF-16, UTF-16LE, or UTF-16BE.

If the value for ICHSET is not specified, GT.M assumes the value of the intrinsic variable $ZCHSET as the default character set for all the input / output devices and "M" if $ZCHSET is not specified.

If expr is set to a value other than "M", "UTF-8", "UTF-16", "UTF-16LE" or ""UTF-16BE"", GT.M triggers a run-time error .

[Note]

UTF-16, UTF-LE, and UTF-16BE are not supported for $Principal and Terminal devices. Please refer to the limitations section for more details.

CHSET=expr Applies to: SD FIFO TRM and SOC

Establishes a common encoding for both input and output devices. The value of the expression can be M, UTF-8, UTF-16, UTF-16LE, or UTF-16BE. For more information, refer to ICHSET and OCHSET.

RECORDSIZE=expr Applies to: SD FIFO

RECORDSIZE overrides the default record size for a disk and specifies the maximum record size in bytes. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

For SD in UTF-8 mode, GT.M treats RECORDSIZE as a byte limit at which to wrap or truncate output depending on [NO]WRAP. For any character set other than "M", GT.M ignores RECORDSIZE for a device which is already open if any I/O has been done.

If the character set is M or UTF-8, the default RECORDSIZE is 32K-1bytes.

If the character set is UTF-16, UTF-16LE or UTF16BE, the RECORDSIZE must always be in multiples of 2. For these character sets, the default RECORDIZE is 32K- 4 bytes.

[NO]FIXED Applies to: SD FIFO

Selects a fixed record length format for sequential disk files. FIXED does not specify the actual length of a record. Use RECORDSIZE to specify the record length. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode with FIXED format, GT.M I/O enforces a more record-oriented view of the file, treating each record as RECORDSIZE bytes long. A READ ignores any PAD bytes found at the end of a record and does not return them to the application.

A READ X gets the remainder of the current record if any characters remain, otherwise it reads an entire new record.

A READ #len returns up to len characters from the current record if any characters remain otherwise it reads up to len characters from a new record. All characters returned are from a single record.

A READ *X returns the code-point for a single character. If there is a character in the current record, READ * returns it, otherwise it fetches a new record and returns a single character from it.

WRITE when WRAP is not enabled writes up to WIDTH - $X display columns. WRITE uses PAD bytes at the end of the record to produce an output record of RECORDSIZE bytes. Note that a Unicode™ code-point never splits across records. A combining character may end up in the subsequent record if it does not fit in the current record.

WRITE when WRAP is enabled starts new records as required with no more than WIDTH characters per record. WRITE uses PAD bytes at the end of the record to produce an output record of RECORDSIZE bytes; without writing any partial characters in Unicode™ .

In both of the above WRITE cases where the command has multiple arguments, WRITE handles each argument individually except in the case of a sequence of literals, which it combines into a single argument.

WRITE ! writes WIDTH - $X spaces followed by PAD bytes as required to pad the record to RECORDSIZE bytes.

PAD=expr Applies to: SD FIFO

For FIXED format sequential files and when the character set is not "M", if a multi-byte character (when CHSET is UTF-8) or a surrogate pair (when CHSET is UTF-16) does not fit into the record (either logical as given by WIDTH or physical as given by RECORDSIZE) the WRITE command pads the bytes specified by the PAD deviceparameter to fill out the physical record. READ ignores the pad bytes when found at the end of the record. The value for PAD is given as an integer in the range 0-127 (the ASCII characters). The default PAD byte value is $ZCHAR(32) or <SPACE>.

Example:
GTM>Set a="准祝新年在上海"
GTM>Set encoding="UTF-8"
GTM>Set filename="bom"_encoding_".txt" 
GTM>Open filename:(newversion:FIXED:RECORDSIZE=8:PAD=66:chset=encoding)
GTM>Use filename
GTM>Write a
GTM>Close filename
GTM>Halt
$ cat bomUTF-8.txt 
准祝BB新年BB在上BB海

In the above example, the local variable a is set to a string of three-byte characters. PAD=66 sets padding byte value to $CHAR(66)

Read

R[EAD][:tvexpr] (glvn|*glvn|glvn#intexpr)[:numexpr]|strlit|fcc[,...]

The READ command transfers input from the current device to a global or local variable specified as a READ argument. For convenience, READ also accepts arguments that perform limited output to the current device. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode, the READ command uses the character set value specified on the device OPEN as the character encoding of the input device. If character set "M" or "UTF-8" is specified, the data is read with no transformation. If character set is "UTF-16", "UTF-16LE", or "UTF-16BE", the data is read with the specified encoding and transformed to UTF-8. If the READ command encounters an illegal character or a character outside the selected representation, it triggers a run-time error. The READ command recognizes all Unicode™ line terminators for non-FIXED devices. See “Line Terminators” section for more details.

[Note]

In M mode, characters and bytes have a one-to-one relationship and therefore READ can be used to read bit-streams of non-character data.

Read # Command

When a number sign (#) and a non-zero integer expression immediately follow the variable name, the integer expression determines the maximum number of characters accepted as the input to the READ command. In UTF-8 mode, this can occur in the middle of a sequence of combining code-points (some of which are typically non-spacing). When this happens, any display on the input device, may not represent the characters returned by the fixed-length READ (READ #).

Read * Command

In UTF-8 mode, the READ * command accepts one character in Unicode™ of input and puts the numeric code-point value for that character into the variable.

In M mode, the READ * command reads a single byte and returns the numeric byte value. If character set UTF-8 is specified, the READ * command reads one to four bytes, depending on the encoding and returns the numeric code-point value of the character. If ICHSET specifies "UTF-16", "UTF-16LE" or "UTF-16BE", the READ * command reads a byte pair or two byte pairs (if it is a surrogate pair) and returns the numeric code-point value.

Example:
GTM>Set filename="mydata.out"; assume that mydata.out contains "新年好"
GTM>Open filename:(readonly:ichset="UTF-16LE")
GTM>Use filename
GTM>Read *x
GTM>Close filename
GTM>Write $char(x)

In the above example, the READ * command reads the first character of the file mydata.out according to the encoding specified by ICHSET

Write

W[RITE][:tvexpr] expr|*intexpr|fcc[,...]

The WRITE command transfers a character stream specified by its arguments to the current device. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode, the WRITE command uses the character set specified on the device OPEN as the character encoding of the output device. If character set specifies "M" or "UTF-8", GT.M WRITEs the data with no transformation. If character set specifies "UTF-16", "UTF-16LE" or "UTF-16BE", the data is assumed to be encoded in UTF-8 and WRITE transforms it to the character encoding specified by character set device parameter.

Example:
GTM>Set filename="mydata.out"
GTM>Set T16LE="准备庆祝新年在上海"
GTM>Open filename:(chset="UTF-16LE")
GTM>Use filename
GTM>Write T16LE
GTM>Close filename

The above example creates a file mydata.out in UTF-16LE character set.

[Important]

If a WRITE command encounters an illegal character, it triggers a run-time error irrespective of the setting of VIEW "BADCHAR".

In M mode, the WRITE command ignores the OCHSET deviceparameter .

Write * Command

When the argument of a WRITE command consists of a leading asterisk (*) followed by an integer expression, the WRITE command outputs the character represented by the code-point value of that integer expression.

With character set M specified at device OPEN, the WRITE * command transfers the character (byte) associated with the numeric value of the integer expression. With character UTF-8 specified at device OPEN, the WRITE command outputs the character associated with the numeric code-point value. If character set "UTF-16", "UTF-16LE" or "UTF-16BE" is specified, WRITE * transforms the character code to the mapping specified by that character set.

Cursor Position Variable

$X

$X is a special intrinsic variable that determines the current column position of the cursor for the current device. $X contains an integer value ranging from 0 to 65,535, specifying the horizontal position of a virtual cursor in the current output record. $X=0 represents the left-most position of a record or row. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

For UTF-8 mode and TRM and SD output, $X increases by the display-columns of a given string that is written to the current device.

Example:
GTM> Write $ZCHSET
UTF-8
GTM>Set a="准祝"
GTM>Use $Principal:WIDTH=40
GTM>Write a,$X
准祝4
GTM>

In the above example, the Use command set the width of $Principal device to 40 display columns. $X returns 4 because each character in local variable a occupied 2 display positions.

Use Deviceparameters

WIDTH=intexpr Applies to: TRM SOC NULL SD FIFO

Sets the device's logical record size and enables WRAP. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

In UTF-8 mode and TRM and SD output, the WIDTH deviceparameter specifies the display-columns and is used with $X to control truncation and WRAPing of the visual representation of the stream.

In M mode if WIDTH is set to 0, GT.M uses the default WIDTH of the TRM and SOC devices. USE x:WIDTH=0 is equivalent to USE x:(WIDTH=<device-default>:NOWRAP. For SD and FIFO devices in M mode, setting WIDTH to 0 is not allowed.

In UTF-8 mode, WIDTH=0 disables formatting control based on comparison of WIDTH and $X but does not affect the control of WIDTH over the behavior when the output exceeds RECORDSIZE.

Example:
GTM>Set a="准备庆祝新年在上海"
GTM>Set encoding="UTF-8"
GTM>Set filename="my"_encoding_".txt" 
GTM>Open filename:(newversion:chset=encoding)
GTM>Use filename:WIDTH=4
GTM>Write a
GTM>Close filename
GTM>Halt
$ cat myUTF-8.txt 
准备
庆祝
新年
在上
海

GT.M format control characters, FILTER, and the device WIDTH and WRAP also have an effect on $X.

In UTF-8 mode and SOC output, the WIDTH deviceparameter specifies the number of characters in Unicode™.

[NO]WRAP Applies to: TRM SOC NULL SD FIFO

Enables or disables automatic record termination. When the current record size ($X) reaches the maximum WIDTH and the device has WRAP enabled, GT.M starts a new record, as if the routine had issued a WRITE ! command. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

For UTF-8 mode and SD output, WRAP or truncation occur when WRITEs exceed either WIDTH(display-columns) or RECORDSIZE (bytes).

Line Terminators

For non FIXED format sequential files and terminal devices for which character set is not M, all the standard Unicode™ line terminators terminate the logical record. These are U+000A (LF), U+0000D (CR), U+000D followed by U+000A (CRLF), U+0085 (NEL), U+000C (FF), U+2028 (LS) and U+2029 (PS). For these devices, LF is used to terminate a record on output though if FILTER=CHARACTER is enabled, all of the terminators are recognized to maintain the values of $X and $Y.

Unicode™ Byte Order Marker (BOM)

When the ICHSET for a device is not "M", if BOM (U+FEFF) is at the beginning of the initial input for a file or data stream, GT.M uses it to determine the endian if the ICHSET is UTF-16 and checks for agreement with ICHSET UTF-16BE or UTF-16LE.

If character set for a device is UTF-16, GT.M uses BOM (U+FEFF) to determine the endians. . For this to happen, the BOM must be at at the beginning of the initial input for a file or data stream. If there is no BOM present, GT.M assumes big endianess.

If the character set of a device is UTF-8, GT.M checks for and ignores a BOM on input.

If the BOM does not match the character set specified at device OPEN, GT.M triggers an error. READ does not return BOM to the application and the BOM is not counted as part of the first record.

If the output character set for a device is UTF-16 (but not UTF-16BE or UTF-16LE,) GT.M writes a BOM before the initial output. The application code does not need to explicitly write the BOM.

Deviceparameter Summary

The measurement characteristics of some deviceparameters change when they are applied to certain devices. For example, terminal WIDTH is measured in display-columns whereas socket WIDTH is measured in code-points. The following tables lists the units of measurement (byte, code-point, or display-column) for $X and deviceparameters for TRM, SD, SOC, and FIFO. All deviceparameters that are not described in this section remain unchanged from preceding releases. "-" denotes that the deviceparameter has no effect for that device.

Terminal (TRM) Device

Device

$X

RECORDSIZE

WIDTH

PAD

TRM

Display-column

Byte

Display-column

-

SD

Display-column

Byte

Display-column

Code-point

SOC and FIFO

Code-point

Byte

Code-point

-

[Note]
  1. In M mode, display-columns, characters and bytes are all equivalent

  2. In all UTF-16 I/O modes, RECORDSIZE must be even and PAD characters are two bytes

  3. GT.M implements SD output in a fashion that supports copying files to display devices such as printers and terminals.

String Processing functions

A multi-byte character can be made up of a base character, composite character, or a pre-composed character of various letter/diacritic combinations. In UTF-8 mode, all string processing functions identify each character as a distinctive unit of writing in the context of a particular writing method. However, in M mode, GT.M unconditionally treats characters as strings of octets (8-bit bytes).

To provide additional flexibility for performing byte-oriented operations in a process started in UTF-8 mode, GT.M provides "Z equivalents" of the traditional string processing functions. These functions are:

  • $ZA[SCII](expr[,intexpr])

  • $ZC[HAR](intexpr[,…])

  • $ZE[XTRACT](expr[,intexpr1[,intexpr2]])

  • $ZF[IND](expr1,expr2[,intexpr])

  • $ZJ[USTIFY](expr,intexpr1[,intexpr2])

  • $ZL[ENGTH](expr1[,expr2])

  • $ZP[IECE](expr1,expr2[,intexpr1[,intexpr2]])

  • $ZTR[ANSLATE](expr1[,expr2[,expr3]])

These Z equivalent functions exhibit the same behavior as their traditional M counterparts operating in M mode. For example, in UTF-8 mode, the length of a string in characters is less than or equal to the lengths of strings in bytes. In this mode, the $LENGTH() function considers sequences of bytes to be strings of characters encoded in UTF-8 and returns the number of characters. The new Z equivalent function $ZLENGTH() considers sequences of bytes to be simply strings of octets (8-bit bytes) and returns the number of bytes just as $LENGTH() does when operating in M mode. All Z equivalent functions are independent of the value of $ZCHSET or VIEW [NO]BADCHAR.

The Z equivalent functions come in handy when applications need to process binary data including blobs, binary byte streams, bit-masks, and so on while simultaneously operating in UTF-8 mode.

In addition to the Z equivalent functions, GT.M now provides the following Z functions:

  • $ZCONVERT() function

    • Two argument form: $ZCO[NVERT](expr1,expr2)

    • Three argument form: $ZCO[NVERT](expr1,expr2,expr3)

  • $ZSUB[STR](expr,intexpr1[,intexpr2])

  • $ZW[IDTH](expr)

[Note]

Unlike the Z equivalent functions, the new Z functions do not have traditional M counterparts. They provide new functionality related to Unicode™.

The following sections describe the behavior of all the string-processing functions in UTF-8 mode and M mode.

$ASCII()

The $ASCII() function returns the integer code for a character in a string. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

With character set UTF-8 specified, the $ASCII() function returns a decimal representation of the integer Unicode™ code-point value of a character in the given string.

In the Unicode™ Standard, the code-point is the hexadecimal integer that appears after the “U+” in the definition of each character) and may be as large as 1114109, corresponding to U+10FFFD.

[Note]

Although it seems counter-intuitive for a function called $ASCII() to return Unicode™-point values, the M standard has retained the name of $ASCII() for all character sets, which minimizes the application code changes needed to add the support for Unicode™.

Examples of $ASCII() in UTF-8 mode
GTM>W $ZCHSET
UTF-8
GTM>W $ASCII("") 
26032 
GTM> W $$FUNC^%DH("26032")
000065B0

In the above example, 26032 is the integer equivalent of the hexadecimal value 65B0. U+65B0 is a character in the CJK Ideograph block of the Unicode™ Standard.

$Char()

The $CHAR() function returns a string of one or more characters corresponding to integer codes specified in its argument(s). Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $CHAR() function returns a string composed of characters represented by the integer equivalents of the Unicode™ code-points specified in its argument(s).

With VIEW NOBADCHAR enabled, the $CHAR() function ignores all expressions that do not correspond to valid Unicode™ code-points,the $CHAR() function never returns a string with illegal or invalid characters .

With VIEW BADCHAR enabled, the $CHAR() function triggers a run-time error if any expression evaluates to a code-point value that is not a character in Unicode™ According to the Unicode™ Standard version 5.0, invalid code-points include the following sets:

  • The "too big" code-points (those greater than the maximum U+10FFFF).

  • The "surrogate" code-points (in the range [U+D800, U+DFFF]) which are reserved for UTF-16 encoding.

  • The "non-character" code-points that are always guaranteed to be not assigned to any valid characters. This set consists of [U+FDD0, U+FDEF] and all U+nFFFE and U+nFFFF (for each n from 0x0 to 0x10).

Example of $CHAR() in UTF-8 mode
GTM>W $ZCHSET 
UTF-8
GTM> W $CHAR(26032)

GTM> W $CHAR(65)
A

In the above example, the integer value 26032 is the Unicode™ character "" in the CJK Ideograph block of Unicode™.

[Note]

The output of the $CHAR() function for values of integer expression(s) from 0 through 127 does not vary with choice of the character encoding scheme. This is because 7-bit ASCII is a proper subset of UTF-8 character encoding scheme. The representation of characters returned by the $CHAR() function for values 128 through 255 differ for each character encoding scheme.

[Important]

When compiling a program with VIEW "BADCHAR" and a literal argument for the $CHAR() function specifies an illegal character, the GT.M compiler triggers a BADCHAR error and embeds that error in the object in case the object is every used. When compiling a program with VIEW "NOBADCHAR" and a literal argument for $CHAR() specifies an illegal character, the GT.M compiler does not trigger the BADCHAR error nor can the GT.M run-time system detect the error. Therefore, application developers must ensure a routine is compiled and executed with appropriately chosen (usually matching) settings of VIEW "BADCHAR".

$Extract()

The $EXTRACT() function returns a substring of a given string. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $EXTRACT() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $EXTRACT() function triggers a run-time error when it encounters a character in the reserved range of the Unicode™ Standard, but it does not process the characters that fall after the span specified by the arguments.

[Note]

For byte-oriented operations, use $ZEXTRACT(), as $EXTRACT() in NOBADCHAR mode interprets its string arguments as character, rather than byte-oriented and only returns byte-oriented results when all characters in its arguments are encoded in a single byte.

Examples of $EXTRACT() in UTF-8 mode
Example:
GTM>FOR i=0:1:4 WRITE !,$EXTRACT("新年好",i),"<" 
<
新<
年<
好<
<
GTM>

This loop displays the result of $EXTRACT(), specifying no ending character position and a beginning character position "before, " first and second positions, and "after" the string.

Example:
GTM>FOR i=0:1:4 WRITE !,$E("新年好",1,i),"<" 
<
新<
新年<
新年好<
新年好<
GTM>

This loop displays the result of $EXTRACT() specifying a beginning character position of 1 and an ending character position "before, "first and second positions, and "after" the string.

Example:
TRIM(x) 
	NEW i,j
	FOR j=$L(x):-1:0 S nx=$E(x,1,j) Q:$EXTRACT(x,j)'=" " 
	FOR i=1:1:j S fx=$E(nx,i,$L(x)) Q:$EXTRACT(x,i)'=" " 
	QUIT fx 
GTM>SET str=" 新年好 "
GTM>WRITE $LENGTH(str)
5 
GTM>WRITE $LENGTH($$TRIM^trim(str))
3

This extrinsic function uses $EXTRACT() to remove extra leading and trailing spaces from its argument.

$Find()

The $FIND() function returns an integer character position that locates the occurrence of a substring within a string. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $FIND() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $FIND() function triggers a run-time error when it encounters a malformed character, but it does not process the characters that fall after the span specified by the arguments.

[Note]

The $FIND() function must never be used for byte-oriented operations.

Examples of $FIND() in UTF-8 mode
Example:
GTM> WRITE $FIND("新年好","年") 
3 
GTM> 

This example uses the $FIND() function to WRITE the position of the first occurrence of the character "". The return of 3 gives the position after the "found" substring.

Example:
GTM> WRITE $FIND("准备庆祝新年在上海:准备庆祝新年在上海","上",9) 
19 
GTM> 

This example uses $FIND() to WRITE the position after the next occurrence of the character " " starting in character position nine.

Example:
GTM> SET t=1 FOR  SET t=$FIND("准备庆祝新年在上海:准备庆祝新年在上海","祝新",t) Q:'t  W !,t 
6 
16 
GTM> 

This example uses a loop with $FIND() to locate all occurrences of "祝新" in "准备庆祝新年在上海:准备庆祝新年在上海". The $FIND() returns 6 and 16 giving the positions after the two occurrences of "祝新".

$Justify()

The $JUSTIFY function returns a formatted string. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $JUSTIFY() function interprets the string argument as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $JUSTIFY() function triggers a run-time error when it encounters a malformed character.

Examples of $JUSTIFY() in UTF-8 mode
Example:
GTM> WRITE $JUSTIFY("新年好",10),!,$JUSTIFY("准备庆祝新年在上海",5) 
       新年好
准备庆祝新年在上海
GTM>

The above example uses the $JUSTIFY() to display "新年好" in a field of 10 spaces and "准备庆祝新年在上海" in a field of 5 spaces. Because the length of "准备庆祝新年在上海" exceeds five spaces, the result overflows the specification.

Example:
GTM> WRITE "1234567890",!,$JUSTIFY(10.545,10,2) 
1234567890 
     10.55 
GTM>

This uses $JUSTIFY() to WRITE a rounded value right justified in a field of 10 spaces. Notice that the result has been rounded up.

Example:
GTM> WRITE "1234567890",!,$JUSTIFY(10.544,10,2) 
1234567890 
     10.54 
GTM> 

Again, this uses $JUSTIFY() to WRITE a rounded value right justified in a field of 10 spaces. Notice that the result has been rounded down.

Example:
GTM> WRITE "1234567890",!,$JUSTIFY(10.5,10,2) 
1234567890 
     10.50 
GTM> 

Once again, this uses $JUSTIFY() to WRITE a rounded value right justified in a field of 10 spaces. Notice that the result has been zero-filled to 2 places.

Example:
GTM> WRITE $JUSTIFY(.34,0,2)     
0.34 
GTM> 

This example uses $JUSTIFY() to ensure the fraction has a leading zero. Note the use of a second argument of zero in the case that rounding is the only function that $JUSTIFY is to perform.

$Length()

The $LENGTH() function returns the length of a string measured in characters, or in "pieces" separated by a delimiter specified by one of its arguments. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $LENGTH() function interprets the string argument(s) as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $LENGTH() function triggers a run-time error when it encounters a malformed character.

Examples of $LENGTH() in UTF-8 mode
Example:
GTM> WRITE $LENGTH("新年好") 
3 
GTM> 

This uses $LENGTH() to WRITE the length in characters of the string "新年好".

Example:
GTM> SET x="新年好/准备庆祝新年在上海/准备庆" 
GTM> WRITE $LENGTH(x,"/") 
3 
GTM> 

This uses $LENGTH() to WRITE the number of pieces in a string, as delimited by /.

Example:
GTM> WRITE $LENGTH("/新/年好/","/") 
4 
GTM> 

This also uses $LENGTH() to WRITE the number of pieces in a string, as delimited by /. Notice that GT.M. counts both the empty beginning and ending pieces in the string because they are both delimited.

$Piece()

The $PIECE() function returns a substring delimited by a specified string delimiter made up of one or more characters. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $LENGTH() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $PIECE() function triggers a run-time error when it encounters a malformed character, but it does not process the characters that fall after the span specified by the arguments.

Examples of $PIECE() in UTF-8 mode
Example:
GTM< FOR i=0:1:4 WRITE !,$PIECE("新 年 好"," ",i),"<" 

<
新<
年<
好<
<
GTM>

This loop displays the result of $PIECE(), specifying a space as a delimiter, a piece position "before," first second, third and "after" the string.

Example:
GTM< FOR i=-1:1:4 WRITE !,$PIECE("新 年 好"," ",i,i+1),"<" 
<
新<
新 年<
年 好<
好<
<
GTM>

This example is similar to the previous example except that it displays two pieces on each iteration. Notice the delimiter (a space) in the middle of the output for the third iteration, which displays both pieces.

Example:
F p=1:1:$L(x,"/") W ?p-1*10,$piece(x,"/",p) 

This loop uses $LENGTH() and $PIECE() to display all the pieces of x in columnar format.

Example:
GTM> s $P(x,".",25)="" W x 

This SETs the 25th piece of the variable x to null, with a delimiter of a period. This produces a string of 24 periods preceding the null.

$TRanslate()

The $TRANSLATE() function returns a string that results from replacing or dropping characters in the first of its arguments as specified by the patterns of its other arguments. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the algorithm of the $TRANSLATE() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $TRANSLATE() function triggers a run-time error when it encounters a malformed character.

Examples of $TRANSLATE() in UTF-8 mode
Example:
GTM> WRITE $TR("新年好","年好","1") 
1 
GTM> 
  • As "" (the first character in the first expression) does not exist in the second expression ("好年"), it appears unchanged in the result.

  • As "" (the second character in the first expression) holds the second position in the second expression ("好年"), and there is no second character in the third expression, $TRANSLATE() replaces occurrences of "" with a null, effectively deleting it from the result.

  • As "" (the third character in the first expression) holds the first position in the second expression ("好年"), $TRANSLATE() replaces occurrences of "" with 1, which is in the first, and corresponding, position of the third expression. The translated result is "1".

Example:
GTM>WRITE $TR("新","X新","年好") 
好

This $TRANSLATE() example finds the position of first occurrence of the first expression in the second expression. Because the character "" is in the second position, the output of the $TRANSLATE() function displays the character in the second position of the third expression.

Example:
GTM> WRITE $TR("新年好","好新") 
年
GTM>

As the $TRANSLATE() has only two parameters in this example, it finds the characters in the first expression that also exist in the second expression and deletes them from the result.

$Z Equivalent Functions

GT.M provides a number of functions that are analogous to the standard functions except that they support byte-oriented operations. In M mode, these functions are exactly equivalent to the standard function. In UTF-8, these functions provide a means to operate on arbitrary strings containing bytes that do not necessarily represent valid code-points. For code to operate properly in both modes, the $Z equivalent functions must always be used for operations that are byte-oriented rather than character-oriented.

$ZASCII()

The $ZASCII() function returns the numeric byte value (0 through 255) of a given sequence of octets (8-bit bytes) .

The format for the $ASCII function is:

$ZA[SCII](expr[,intexpr])

  • The expression acts as the sequence of octets (8-bit bytes) from which $ZASCII() extracts the byte it decodes.

  • The optional integer expression contains the position within the expression of the byte that $ZASCII() decodes. If this argument is missing, $ZASCII() returns a result based on the first byte position. $ZASCII() starts numbering byte positions at one (1), (the first byte of a string is at position one (1)).

  • If the explicit or implicit position is before the beginning or after the end of the expression, $ZASCII() returns a value of negative one (-1).

$ZASCII() provides a means of examining bytes in a byte sequence. Used with $ZCHAR(), $ZASCII() also provides a means to perform arithmetic operations on the byte values associated with a sequence of octets.

Example of $ZASCII()
Example:
GTM>FOR i=0:1:4 WRITE !,$ZA("",i) 

-1
230
150
176
-1

This loop displays the result of $ZASCII() specifying a byte position before, first, second and third positions, and after the sequence of octets (8-bit bytes) represented by . In the above example, 230, 150, and 176 represents the numeric byte value of the three-byte in the sequence of octets (8-bit bytes) represented by .

$ZChar()

The $ZCHAR() function returns a byte sequence of one or more bytes corresponding to numeric byte value (0 through 255) specified in its argument(s).

The format for the $ZCHAR() function is:

$ZC[HAR](intexpr[,...])

  • The integer expression(s) specify the numeric byte value of the bytes(s) $ZCHAR() returns.

GT.M limits the number of arguments to a maximum of 254. $CHAR() provides a means of producing byte sequences. Used with $ZASCII(), $ZCHAR() can also perform arithmetic operations on the byte values of the bytes associated with a sequence of octets (8-bit bytes).

Example of $ZCHAR()
GTM> $ZCHAR(230,150,176,7) 

GTM>

This example uses $ZCHAR() to WRITE the byte sequence represented by and signal the terminal "bell."

$ZExtract()

The $ZEXTRACT() function returns a byte sequence of a given sequence of octets (8-bit bytes) .

The format for the $ZEXTRACT function is:

$ZE[XTRACT](expr[,intexpr1[,intexpr2]])

  • The expression specifies a sequence of octets (8-bit bytes) from which $ZEXTRACT() derives a byte sequence.

  • The first optional integer expression (second argument) specifies the starting byte position in the byte string expr of the substring result. If the starting position is beyond the end of the expression, $ZEXTRACT() returns the null string. If the starting position is zero (0) or negative, $ZEXTRACT() starts at the first byte position in the expression; if this argument is omitted, $ZEXTRACT() returns the first byte of the expression. $ZEXTRACT() numbers byte positions starting at one (1) (the first byte of a sequence of octets (8-bit bytes) is at position one (1)).

  • The second optional integer expression (third argument) specifies the ending byte position for the result. If the ending position is beyond the end of the expression, $ZEXTRACT() stops with the last byte of the expression. If the ending position precedes the starting position, $ZEXTRACT() returns null . If this argument is omitted, $ZEXTRACT() returns one byte.

  • $ZEXTRACT() provides a tool for manipulating strings based on byte positions.

  • As $ZEXTRACT() operates on bytes, it can produce a string that is not well-formed according to the UTF-8 character set.

Examples of $ZEXTRACT()
Example:
GTM>FOR i=0:1:9 WRITE !,$ASCII($ZEXTRACT("新年好",i)),"<"
-1<
230<
150<
176<
229<
185<
180<
229<
165<
189<

This loop displays the numeric byte sequence of the sequence of octets ("新年好").

$ZFind()

The $ZFIND() function returns an integer byte position that locates the occurrence of a byte sequence within a sequence of octets(8-bit bytes).

The format of the $ZFIND() function is:

$ZF[IND](expr1,expr2[,intexpr])

  • The first expression specifies the sequence of octets (8-bit bytes) in which $ZFIND() searches for the byte sequence.

  • The second expression specifies the byte sequence for which $ZFIND() searches.

  • The optional integer expression identifies the starting byte position for the $ZFIND() search. If this argument is missing, zero (0), or negative, $ZFIND() begins to search from the first position of the sequence of octets (8-bite bytes).

  • If $ZFIND() locates the byte sequence, it returns the position after its last byte. If the end of the byte sequence coincides with the end of the the sequence of octets (expr1), it returns an integer equal to the byte length of the expr1 plus one ($L(expr1)+1).

  • If $FIND() does not locate the byte sequence, it returns zero (0).

$ZFIND() provides a tool to locate byte sequences. The ( [ ) operator and the two-argument $ZLENGTH() are other tools that provide related functionality.

Examples of $ZFIND()
Example:
GTM> WRITE $ZFIND("新年好",$ZCHAR(150)) 
3 
GTM> 

This example uses $ZFIND() to WRITE the position of the first occurrence of the numeric byte code 150. The return of 3 gives the position after the "found" byte.

Example:
GTM> WRITE $ZFIND("新年好",$ZCHAR(229),5) 
8 
GTM> 

This example uses $ZFIND() to WRITE the position of the next occurrence of the byte code 229 starting in byte position five.

Example:
GTM> SET t=1 FOR  SET t=$ZFIND("新年好",$ZCHAR(230,150,176),t) Q:'t  W !,t

4 
GTM> 

This example uses a loop with $ZFIND() to locate all the occurrences of the byte sequence $ZCHAR(230,150,176) in the sequence of octets ("新年好"). The $ZFIND() returns 4 giving the position after the occurrence of byte sequence $ZCHAR(230,150,176).

$ZJustify()

The $JUSTIFY() function returns a formatted and fixed length byte sequence.

The format for the $ZJUSTIFY() function is:

$ZJ[USTIFY](expr,intexpr1[,intexpr2])

  • The expression specifies the sequence of octets formatted by $ZJUSTIFY().

  • The first integer expression (second argument) specifies the minimum size of the resulting byte sequence. If the first integer expression is larger than the length of the expression, $ZJUSTIFY() right justifies the expression to a byte sequence of the specified length by adding leading spaces. Otherwise, $ZJUSTIFY() returns the expression unmodified unless specified by the second integer argument.

  • The optional second integer expression (third argument) specifies the number of digits to follow the decimal point in the result, and forces $ZJUSTIFY() to evaluate the expression as numeric. If the numeric expression has more digits than this argument specifies, $ZJUSTIFY() rounds to obtain the result. If the expression had fewer digits than this argument specifies, $ZJUSTIFY() zero-fills to obtain the result.

  • When the second argument is specified and the first argument evaluates to a fraction between -1 and 1, $ZJUSTIFY() returns a number with a leading zero (0) before the decimal point (.).

$ZJUSTIFY() fills a sequence of octets to create a fixed length byte sequence. However, if the length of the specified expression exceeds the specified byte size, $ZJUSTIFY() does not truncate the result (although it may still round based on the third argument). When required, $ZEXTRACT() performs truncation.

$ZJUSTIFY() optionally rounds the portion of the result after the decimal point. In the absence of the third argument, $ZJUSTIFY() does not restrict the evaluation of the expression. In the presence of the third (rounding) argument, $JUSTIFY() evaluates the expression as a numeric value. The rounding algorithm can be understood as follows:

  • If necessary, the rounding algorithm extends the expression to the right with 0s (zeros) to have at least one more digit than specified by the rounding argument.

  • Then, it adds 5 (five) to the digit position after the digit specified by the rounding argument.

  • Finally, it truncates the result to the specified number of digits. The algorithm rounds up when excess digits specify a half or more of the last retained digit and rounds down when they specify less than a half.

Example of $ZJUSTIFY()
Example:
GTM> WRITE "123456789012345",! WRITE $ZJUSTIFY("新年好",15),!,$ZJUSTIFY("新年好",5) 
123456789012345
      新年好
新年好
GTM>

This uses $ZJUSTIFY() to display the sequence of octets represented by "新年好" in fields of 15 space octets and 5 space octets. Because the byte length of "新年好" is nine, it exceeds 5 spaces, the result overflows the specification.

$ZLength()

The $ZLENGTH() function returns the length of a sequence of octets measured in bytes, or in "pieces" separated by a delimiter specified by one of its arguments.

The format for the $ZLENGTH() function is:

$ZL[ENGTH](expr1[,expr2])

  • The first expression specifies the sequence of octets that $ZLENGTH() "measures".

  • The optional second expression specifies the delimiter that defines the measure; if this argument is missing, $ZLENGTH() returns the number of bytes in the sequence of octets.

  • If the second argument is present and not null, $ZLENGTH() returns one more than the count of the number of occurrences of the second byte sequence in the first byte sequence; if the second argument is null , the M standard specifies that $ZLENGTH() returns a zero (0).

$ZLENGTH() provides a tool for determining the lengths of a sequence of octets in two ways--bytes and pieces. The two argument $ZLENGTH() returns the number of existing pieces, while the one argument returns the number of bytes.

Examples of $ZLength()
Example:
GTM> WRITE $LENGTH("新年好") 
9 
GTM> 

This uses $ZLENGTH() to WRITE the length in bytes of the sequence of octets "新年好".

Example:
GTM> SET x="新"_$ZCHAR(63)_"年"_$ZCHAR(63)_"好" 
GTM> WRITE $ZLENGTH(x,$ZCHAR(63))
2 
GTM>

This uses $ZLENGTH() to WRITE the number of pieces in a sequence of octets, as delimited by the byte code $ZCHAR(63).

Example:
GTM>SET x=$ZCHAR(63)_"新"_$ZCHAR(63)_"年"_$ZCHAR(63)_"好_$ZCHAR(63)" 
GTM>WRITE $ZLENGTH(x,$ZCHAR(63)
4
GTM>

This also uses $ZLENGTH() to WRITE the number of pieces in a sequence of octets, as delimited by byte code $ZCHAR(63). Notice that GT.M. counts both the empty beginning and ending pieces , in the string because they are both delimited.

$ZPiece()

The $ZPIECE() function returns a sequence of bytes delimited by a specified byte sequence made up of one or more bytes. In M, $ZPIECE() returns a logical field from a logical record.

The format for the $ZPIECE function is:

$ZP[IECE](expr1,expr2[,intexpr1[,intexpr2]])

  • The first expression specifies the sequence of octets from which $ZPIECE() takes its result.

  • The second expression specifies the delimiting byte sequence that determines the piece "boundaries"; if this argument is a null string, $ZPIECE() returns a null string.

  • If the second expression does not appear anywhere in the first expression, $ZPIECE() returns the entire first expression (unless forced to return null by the second integer expression).

  • The optional first integer expression (third argument) specifies the beginning piece to return; if this argument is missing, $ZPIECE() returns the first piece.

  • The optional second integer expression (fourth argument) specifies the last piece to return. If this argument is missing, $ZPIECE() returns only one piece unless the first integer expression is zero (0) or negative, in which case it returns a null string. If this argument is less than the first integer expression, $ZPIECE() returns null.

  • If the second integer expression exceeds the actual number of pieces in the first expression, $ZPIECE() returns all of the expression after the delimiter selected by the first integer expression.

  • The $ZPIECE() result never includes the "outside" delimiters; however, when the second integer argument specifies multiple pieces, the result contains the "inside" occurrences of the delimiter.

$ZPIECE() provides a tool for efficiently using values that contain multiple elements or fields, each of which may be variable in length.

Applications typically use a single byte for a $ZPIECE() delimiter (second argument) to minimize storage overhead, and increase efficiency at run-time. The delimiter must be chosen so the data values never contain the delimiter. Failure to enforce this convention with edit checks may result in unanticipated changes in the position of pieces within the data value. The caret symbol (^), backward slash (\), and asterisk (*) characters are examples of popular visible delimiters. Multiple byte delimiters may reduce the likelihood of conflict with field contents. However, they decrease storage efficiency, and are processed with less efficiency than single byte delimiters. Some applications use control characters, which reduce the chances of the delimiter appearing in the data but sacrifice the readability provided by visible delimiters.

A SET command argument can have something that has the format of a $ZPIECE() on the left-hand side of its equal sign (=). This construct permits easy maintenance of individual pieces within a sequence of octets. It also can be used to generate a byte sequence of delimiters. For more information on SET $ZPIECE(), refer to SET in the "Commands" chapter.

Examples of $ZPIECE()
Example:
GTM>FOR i=0:1:3 WRITE !,$ZPIECE("新"_$ZCHAR(64)_"年",$ZCHAR(64),i),"<" 

<
新<
年<
<
GTM>

This loop displays the result of $ZPIECE(), specifying $ZCHAR(64) as a delimiter, a piece position "before," first and second, and "after" the sequence of octets.

Example:
GTM>FOR i=-1:1:3 WRITE !,$ZPIECE("新_$ZCHAR(64)_"年",$ZCHAR(64),i,i+1),"<" 
<
新<
新 年<
年<
<
GTM>

This example is similar to the previous example except that it displays two pieces on each iteration. Notice the delimiter (a space) in the middle of the output for the third iteration, which displays both pieces.

Example:
F p=1:1:$ZL(x,"/") W ?p-1*10,$zpiece(x,"/",p) 

This loop uses $ZLENGTH() and $ZPIECE() to display all the pieces of x in columnar format.

Example:
GTM> s $P(x,$ZCHAR(64),25)="" W x 
新年好@@@@@@@@@@@@@@@@@@@@@@@@

This SETs the 25th piece of the variable x to null, with delimiter $ZCHAR(64). This produces a byte sequence of 24 periods preceding the null.

$ZTRanslate()

The $ZTRANSLATE() function returns a byte sequence that results from replacing or dropping bytes in the first of its arguments as specified by the patterns of its other arguments.

The format for the $ZTRANSLATE() function is:

$ZTR[ANSLATE](expr1[,expr2[,expr3]])

  • The first expression specifies the sequence of octets on which $ZTRANSLATE() operates. If the other arguments are omitted, $ZTRANSLATE() returns this expression.

  • The optional second expression specifies the byte for $TRANSLATE() to replace. If a byte occurs more than once in the second expression, the first occurrence controls the translation, and $ZTRANSLATE() ignores subsequent occurrences. If this argument is omitted, $ZTRANSLATE() returns the first expression without modification.

  • The optional third expression specifies the replacement byte sequence for the second expression that corresponds by position. If this argument is omitted or shorter than the second expression, $ZTRANSLATE() drops all occurrences of the bytes in the second expression that have no replacement in the corresponding position of the third expression.

$ZTRANSLATE() provides a tool for tasks such as encryption.

The $ZTRANSLATE() algorithm can be understood as follows:

  • $ZTRANSLATE() evaluates each byte in the first expression, comparing it byte by byte to the second expression looking for a match. If there is no match in the second expression, the resulting expression contains the byte without modification.

  • When it locates a byte match, $ZTRANSLATE() uses the position of the match in the second expression to identify the appropriate replacement for the original expression. If the second expression has more bytes than the third expression, $ZTRANSLATE() replaces the original byte with a null, thereby deleting it from the result. By extension of this principle, if the third expression is missing, $ZTRANSLATE() deletes all bytes from the first expression that occur in the second expression.

Examples of $ZTRANSLATE()
Example:
GTM>set hiraganaA="あ" ; # $ZCHAR(227,129,130)
GTM>set temp1=$ZCHAR(130)
GTM>set temp2=$ZCHAR(140)
GTM>W hiraganaA
あ
GTM>W $ZTRANSLATE(hiraganaA,temp1,temp2)
が
GTM>

In the above example, $ZTRANSLATE() replaces byte $ZCHAR(130) in first expression (あ) and matching the first (and only) byte in the second expression with byte $ZCHAR(140) - the corresponding byte in the third expression. The translated result is が.

New $Z Functions

$ZCOnvert()

The $ZCONVERT() function returns its first argument as a string converted to a different encoding. The two argument form changes the encoding for case within a character set. The three argument form changes the encoding scheme.

The format for the $ZCONVERT() function is:

$ZCO[NVERT](expr1, expr2,[expr3])

  • The first expression is the string to convert. If the expression contains a code-point value that is not in the character set, $ZCONVERT() generates a run-time error.

  • In the two argument form, the second expression specifies a code that determines the form of the result. In the three-argument form, the second expression specifies a code that controls the character set interpretation of the first argument. If the expression does not evaluate to one of the defined codes corresponding to a valid code for the number of available arguments, $ZCONVERT() generates a run-time error.

  • The optional third expression specifies the a code that determines the character set of the result. If the expression does not evaluate to one of the defined codes $ZCONVERT() generates a run-time argument. The three-argument form is not supported in M mode.

The valid (case insensitive) character codes for expr2 in the two-argument form are:

  • U converts the string to UPPER-CASE. "UPPER-CASE" refers to words where all the characters are converted to their "capital letter" equivalents. Characters that are already in UPPER-CASE "capital letter" are retained unchanged.

  • L converts the string to lower-case. "lower-case" refers to words where all the letters are converted to their “small letter” equivalents. Characters that are already in lower-case or have no lower-case equivalent are retained unchanged.

  • T converts the string to title case. "Title case" refers to a string where the first character of each word is in the upper-case and the remaining ones in the lower-case. Characters that are already in the “Title case” are retained unchanged. “T” (title case) is not supported in M mode.

The valid (case insensitive) codes for character set encoding for expr2 and expr3 in the three-argument form are:

  • "UTF-8"-- a multi-byte variable length encoding form of Unicode™.

  • "UTF-16LE"-- a multi-byte 16-bit encoding form of Unicode™ in little-endian.

  • "UTF-16BE"-- a multi-byte 16-bit encoding form of Unicode™ in big-endian.

[Note]

When UTF-8 mode is enabled, GT.M uses the ICU Library to perform case conversion. As mentioned in the Theory of Operation section, the case conversion of the strings occurs according to Unicode™ code-point values. This may not be the linguistically or culturally correct case conversion, for example, of the names in the telephone directories. Therefore, application developers must ensure that the actual case conversion is linguistically and culturally correct for their specific needs. The two-argument form of the $ZCONVERT() function in M mode does not use the ICU Library to perform operation related to the case conversion of the strings.

Examples of $ZCONVERT()
Example:
GTM>W $ZCONVERT("Happy New Year","U")
HAPPY NEW YEAR
Example:
GTM>W $ZCHSET 
M
GTM> W $ZCONVERT("HAPPY NEW YEAR","T")
%GTM-E-BADCASECODE, T is not a valid case conversion code
Example:
GTM>S T8="准备庆祝新年在上海"
GTM>W $L(T8)
9
GTM>S T16=$ZCONVERT(T8,"UTF-8","UTF-16LE")
GTM>W $L(T16) 
%GTM-E-BADCHAR, $ZCHAR(198) is not a valid character in the UTF-8 encoding form
GTM>S T16=$ZCONVERT(T16,"UTF-16LE","UTF-8")
GTM>W $L(T16)
9

In the above example, $LENGTH() function triggers an error because it takes only UTF-8 encoding strings as the argument.

$ZSUBstr()

The $ZSUBSTR() function returns a properly encoded string from a sequence of bytes.

$ZSUB[STR] (expr ,intexpr1 [,intexpr2])

  • The first expression is an expression of the byte string from which $ZSUBSTR() function derives the character sequence.

  • The second expression is the starting byte position (counting from 1 for the first position) in the first expression from where $ZSUBSTR() begins to derive the character sequence.

  • The optional third expression specifies the number of bytes from the starting byte position specified by the second expression that contribute to the result. If the third expression is not specified, the $ZSUBSTR() function returns the sequence of characters starting from the byte position specified by the second expression up to the end of the byte string.

  • The $ZSUBSTR() function never returns a string with illegal or invalid characters. With VIEW NOBADCHAR enabled, the $ZSUBSTR() function ignores all byte sequences within the specified range that do not correspond to valid Unicode™ code-points, With VIEW BADCHAR enabled, the $ZSUBSTR() function triggers a run-time error if the specified byte sequence contains a code-point value that is not in the character set.

The $ZSUBSTR() function is a new function introduced in conjunction with Unicode™ support. Like the $ZCONVERT() function and the $ZWIDTH() function, it does not have a traditional M equivalent.

Examples of $ZSUBSTR()
Example:
GTM>W $ZCHSET
M
GTM>set char1="a" ; one byte character 
GTM>set char2="ç"; two-byte character
GTM>set char3=""; three-byte character
GTM>set y=char1_char2_char3
GTM>W $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5)
0

With character set M specified, the expression $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5) evaluates to 0 or "false" because the expression $ZSUBSTR(y,1,5) returns more characters than $ZSUBSTR(y,1,3).

Example:
GTM>W $ZCHSET
UTF-8
GTM>set char1="a" ; one byte character 
GTM>set char2="ç"; two-byte character
GTM>set char3=""; three-byte character
GTM>set y=char1_char2_char3
GTM>W $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5)
1

With character set UTF-8 specified, the expression $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5) evaluates to 1 or "true" because the expression $ZSUBSTR(y,1,5) returns a string made up of char1 and char2 excluding the three-byte char3 because it was not completely included in the specified byte-length.

In many ways, the $ZSUBSTR() function is similar to the $ZEXTRACT() function. For example, $ZSUBSTR(expr,intexpr1) is equivalent to $ZEXTRACT(expr,intexpr1,$L(expr)). Note that this means when using the M character set, $ZSUBSTR() behaves identically to $EXTRACT() and $ZEXTRACT().

The differences are as follows:

  • $ZSUBSTR() cannot appear on the left of the equal sign in the SET command where as $ZEXTRACT() can

  • In both the modes, the third expression of $ZSUBSTR() is a byte, rather than character, position within the first expression.

  • $EXTRACT() operates on characters, irrespective of byte length.

  • $ZEXTRACT() operates on bytes, irrespective of multi-byte character boundaries.

  • $ZSUBSTR() is the only way to extract as valid UTF-8 encoded characters from a given byte string. It operates on characters in Unicode™ so that its result does not exceed the given byte length.

$ZWidth()

$ZW[IDTH] (expr)

The $ZWIDTH() function returns the numbers of columns required to display a given string on the screen or printer.

  • The expression is the string which $ZWIDTH() evaluates for display length. If the expression contains a code-point value that is not a valid character in Unicode™ , $ZWIDTH() generates a run-time error.

  • If the expression contains any non-graphic characters, the $ZWIDTH() function does count not those characters.

  • If the string contains any escape sequences containing graphical characters (which they typically do), $ZWIDTH() includes those characters in calculating its result, as it does not do escape processing. In such a case, the result many be larger than the actual display width.

[Important]

The ZWIDTH() function triggers a run-time error if it encounters a malformed byte sequence irrespective of the setting of "BADCHAR".

With character set UTF-8 specified, the $ZWIDTH() function uses the ICU's glyph-related conventions to calculate the number of columns required to represent the expression.

Examples of $ZWIDTH()
Example:
GTM>S NG=$CHAR($$FUNC^%HD("200B"))GTM>S S=$CHAR(26032)_NG_$CHAR(26376) 
GTM>W STR
新​月
GTM>W $ZWIDTH(STR)
4
GTM>

In the above example, the local variable NG contains a non-graphic character which does not display between two double-width characters in Unicode™.

Example:
GTM> W $ZWIDTH("Get ready to celebrate the new year in Shanghai")
47
GTM>S A="新年好"
GTM>W "123456",!,A
123456
新年好
GTM>W $ZWIDTH(A)
6

In the above example, the $ZWIDTH() function returns 6 because each character in A occupies 2 columns when they are displayed on the screen or printer.

Intrinsic Special Variables

$X

For complete description and UTF-8 mode examples, refer to the Cursor Position Variable section earlier in this document.

$ZPATN[umeric]

$ZPATN[UMERIC] is a read-only intrinsic special intrinsic variable that determines how GT.M interprets the patcode “N” used in the pattern match operator. With $ZPATNUMERIC="UTF-8", the patcode “N” matches any numeric character as defined by Unicode™. With $ZPATNUMERIC="M", GT.M restricts the patcode “N” to match only ASCII digits 0-9 (that is, ASCII 48-57). When a process starts in UTF-8 mode, special intrinsic variable $ZPATNUMERIC takes its value from the environment variable gtm_patnumeric. GT.M initializes the special intrinsic variable $ZPATNUMERIC to "UTF-8" if gtm_patnumeric is defined to "UTF-8". If gtm_patnumeric is not defined or set to a value other than "UTF-8", GT.M initializes $ZPATNUMERIC to "M".

$ZPATNUMERIC cannot appear on the left of an equal sign in a SET command. That is: GT.M populates it at process initialization from gtm_patnumeric and does not allow the process to change the value.

$ZCH[set]

The read-only special intrinsic variable $ZCHSET takes its value from the environment variable gtm_chset. An application can obtain the character set used by a GT.M process by the value of $ZCHSET. $ZCHSET can have only two values --"M", or "UTF-8” and it cannot appear on the left of an equal sign in the SET command.

Note that behavior for 7-bit ASCII characters is the same in both "M" and "UTF-8”. Customers operating in M mode are expected to use various ISO-Latin character sets.

GT.M only supports Unicode™ on Unix platforms. In OpenVMS, GT.M always gives special intrinsic variable $ZCHSET the value "M" and ignores the value of the environment variable gtm_chset even if it is defined.

$ZPROMpt

$ZPROM[PT] contains a string value specifying the current Direct Mode prompt. By default, GTM> is the Direct Mode prompt. M routines can modify $ZPROMPT by means of a SET command. $ZPROMPT cannot exceed 31 bytes. If an attempt is made to assign $ZPROMPT to a longer string, GT.M takes only the first 31 bytes and truncates the rest . With character set UTF-8 specified, if the 31st byte is not the end of a valid UTF-8 character, GT.M truncates the $ZPROMPT value at the end of last character that completely fits within the 31 byte limit.

User-defined Collation

As noted in the Theory of Operation section, applications that use characters in Unicode™ may need to implement their own collation functions. For instructions on defining a collation system, please refer to the Chapter 10: Internationalization of the GT.M Programmer's Guide.

By default, GT.M sorts string subscripts in the default order of the Unicode™ numeric code-point ($ASCII()) values. Since this implied ordering may or may not be linguistically or culturally correct for a specific application, an implementation of an algorithm such as the Unicode™ Collation Algorithm (UCA) may be required. Note that implementation of collation in GT.M requires the implementation of two functions, f(x) and g(y). f(x) transforms each input sequence of bytes into an alternative sequence of bytes for storage. Within the GT.M database engine, M nodes are retrieved according to the byte order in which they are stored. For each y that can be generated by f(x), g(y) is an inverse function that provides the original sequence of bytes; in other words, g(f(x)) must be equal to x for all x that the application processes. For example, for the People's Republic of China, it may be appropriate to convert from UTF-8 to Guojia Biaozhun (国家标准), the GB18030 standard, for example, using the libiconv library. The following requirements are important:

  • Unambiguous transformation routines: The transform and its inverse must convert each input string to a unique sequence of bytes for storage, and convert each sequence of bytes stored back to the original string.

  • Collation sequence for all expected character sequences in subscripts: GT.M does not validate the subscript strings passed to/from the collation routines. If the application design allows illegal UTF-8 character sequences to be stored in the database, the collation functions must appropriately transform, and inverse transform, these as well.

  • Handle different string lengths for before and after transformation: If the lengths of the input string and transformed string differ, and, for local variables, if the output buffer passed by GT.M is not sufficient, follow the procedure described below:

  • Global Collation Routines: The transformed key must not exceed 255 bytes, the maximum key size. GT.M allocates a temporary buffer of size 255 bytes in the output string descriptor (of type DSC_K_DTYPE_T) and passes it to the collation routine to return the transformed key.

  • Local Collation Routines: GT.M allocates a temporary buffer in the output string descriptor based on the size of the input string. Both transformation and inverse transformation must check the buffer size, and if it is not sufficient, the transformation must allocate sufficient memory, set the output descriptor value (val field of the descriptor) to point to the new memory , and return the transformed key successfully. Since GT.M copies the key from the output descriptor into its internal structures, it is important that the memory allocated remain available even after the collation routines return. Collation routines are typically called throughout the process lifetime, therefore, GT.M expects the collation libraries to define a large static buffer sufficient to hold all key sizes in the application. Alternatively, the collation transform can use a large heap buffer (allocated by the system malloc() or GT.M gtm_malloc()). Application developers must choose the method best suited to their needs.

Compiling and Linking

To properly handle embedded literals for the same source code, depending on whether $ZCHset is "M" or "UTF-8", GT.M generates different object code. GT.M uses $ZROutines to match object code to source code. If there is no object code, GT.M automatically generates an object in the mode of the current process. If the object code exists and does not match the mode of the current process, GT.M issues an error. This means, when both M and UTF-8 processes are using the same source code, the objects must be stored in separate directories or libraries and have differing $ZROutines values that locate the appropriate object code.

Environment variables

The following table summarizes the Unicode™ related environment variables.

Unix Environment Variables

Variable Name

Description

gtm_chset

Use this environment variable to initialize the value of the special intrinsic variable $ZCHSET. To enable a process to start in UTF-8 mode, the environment variable gtm_chset must be set to "UTF-8".

If the environment variable gtm_chset is not defined, or defined to a value other than "UTF-8", the GT.M processes starts in M mode and assumes each character is encoded in a single-byte. This is the default behavior. The default value of "M" for gtm_chset minimizes the changes to applications coded before Unicode™ support.

gtm_badchar

Use this environment variable to initialize the value of VIEW “BADCHAR”. If gtm_badchar is defined and evaluates to “TRUE” (or ”T”) or “YES” (or “Y” or a non-zero integer, VIEW “BADCHAR” is enabled. Otherwise, VIEW “NOBADCHAR” is enabled. By default, VIEW “BADCHAR” is enabled. For more details please refer to the VIEW command section.

gtm_patnumeric

Use this environment variable to initialize the value of the special intrinsic variable $ZPATNUMERIC in UTF-8 mode.

If the value of special intrinsic variable $ZCHSET is M, GT.M ignores the value of the environment variable gtm_patnumeric and initializes $ZPATNUMERIC to "M".

LC_CTYPE

ICU uses the environment variable LC_CTYPE to determine the locale behavior. In an installation using multiple Unicode™ encoded languages, all processes may have gtm_chset as UTF-8, but might have different LC_CTYPE settings. Using an LC_CTYPE setting that does not match the application assumptions, particularly previously stored data, may cause undesirable results. The process of “setting” LC_CTYPE depends on the shell in use (setenv LC_CTYPE in tch). The action associated with the NONUTF8CHSET error explains how to to start a user down a path to a successful recovery.

[Important]

If LC_CTYPE is a character set with non-UTF-8 support, GT.M fails to startup and reports the NONUTF8CHSET error. Note that the LC_ALL environment variable overrides all the LC_* (locale) variables. GT.M only requires an appropriate setting for LC_CTYPE, other applications or work in the system may dictate whether LC_ALL is appropriate.

Utility Programs

GDE

As noted in the previous sections, a process operating in M mode exhibits unaltered behavior. There is no change in the GDE utility in M mode. In the UTF-8 mode, the changes to the GDE objects are as follows:

GDE Objects

Allowed format

Description

File name

Unicode

GDE allows the name of a file to include characters in Unicode

Global variables and

Region and Segment

ASCII

As there are no changes to the GT.M database format, GDE takes ASCII names for global variables, Regions, and Segments.

GDE commands/qualifier

ASCII

As there are no changes to the GT.M database engine, GDE takes only ASCII names for all the GDE commands and Qualifiers.

GDE Logs and the output generated by the LOG command

Unicode

GDE considers a text file to be encoded in UTF-8 when it is executed via the “@” command.

The global directory file (.gld) format

ASCII

File names in a global directory containing non-ASCII characters may not be displayed properly in a non-Unicode™ environment.

MUPIP

The MUPIP utility now handles Unicode™ data. Both ZWR and GO format of EXTRACT use the ZWRITE format specified in Data Interchange section. In UTF-8 mode MUPIP EXTRACT, MUPIP JOURNAL -EXTRACT and MUPIP JOURNAL -LOSTTRANS write sequential output files in the UTF-8 character encoding form. For example, in UTF-8 mode if ^A has the value of 准备好庆祝新年在纽约, the sequential output file of the MUPIP EXTRACT command is:

09-OCT-2006  04:27:53 ZWR
GT.M MUPIP EXTRACT UTF-8
^A="准备好庆祝新年在纽约。"

Similarly, the MUPIP LOAD command considers a sequential file as encoded in UTF-8 if the environment variable gtm_chset is set to UTF-8.

[Important]

Ensure that MUPIP EXTRACT commands and corresponding MUPIP LOAD commands execute with the same setting for the environment variable gtm_chset. The M utility programs %GO and %GI have the same requirement for mode matching.

MUPIP EXTRact

The MUPIP EXTRACT command adds the label "UTF-8" in the header label of the file extracted in the UTF-8 mode as follows:

MUPIP Commands

UTF-8 Mode

M Mode

MUPIP EXTRACT (both ZWR and GO)

GT.M MUPIP EXTRACT UTF-8
GT.M MUPIP EXTRACT

MUPIP EXTRACT (BINARY)

GDS BINARY EXTRACT LEVEL 
42006082413413901024002560006400000UTF-8 
GT.M MUPIP EXTRACT
GDS BINARY EXTRACT LEVEL 
42006082413413901024002560006400000GT.M 
MUPIP EXTRACT

MUPIP JOURNAL EXTRACT

GDSJEX03 UTF-8
GDSJEX03 

LOST TRANSACTION EXTRACT

GDSJEX03 ROLLBACK PRIMARY INSTANCE1 UTF-8
GDSJEX03 ROLLBACK PRIMARY INSTANCE1

In UTF-8 mode, MUPIP LOAD triggers the LOADINVCHSET error if the header label of an extract file does not contain " UTF-8" as a suffix.

All MUPIP command qualifiers that require file names, keys, or data (for example, MUPIP SET -FILE, MUPIP INTEG -SUBSCRIPT, MUPIP REORG -SELECT qualifiers) accept characters in Unicode™ in UTF-8 mode. Database replication instance names must be ASCII. Although GT.M does not trigger an error if the name of a database replication instance is in Unicode™, FIS recommends the use of ASCII characters for naming all the database replication instances.

If the environment gtm_chset is not defined or is set to M, the MUPIP utility writes the byte-equivalent values of the globals containing characters in Unicode™ in the sequential output file. For example, if ^A has the value of 准备好庆祝新年在纽约, the sequential output file of the MUPIP EXTRACT command is:

09-OCT-2006  04:25:52 ZWR
GT.M MUPIP EXTRACT
^A=""_$C(135,134)_""_$C(135)_""_$C(134)_""_$C(157)_""_$C(150)_"?"_$C(156)_"?纽约"_$C(128,130)

In both modes, if EXTRACT encounters an illegal character, it places $ZCH representation in the sequential output file.

[Note]

MUPIP EXTRACT or MUPIP LOAD respectively produce and accept only abbreviated forms of $CHAR() and $ZCHAR(), that is, $C() and $ZCH() .

DSE & LKE

If the environment variable gtm_chset is set to UTF-8, the DSE DUMP command prints graphic characters for visualization. DSE does not write non-graphic characters and malformed characters to the interpreted output, but instead represents such characters by a dot character.

Example:
dse dump -block=9
File    /home/V52/mumps.dat
Region  DEFAULT
Block 9   Size 24   Level 0   TN 9 V5
Rec:1  Blk 9  Off 10  Size 14  Cmpc 0  Key ^DD
      10 : | 14  0  0  9 44 44  0  0 E5 A4 AA E9 98 B3 E7 9A 84 E5 B9 B4|
           |  .  .  .  .  D  D  .  .       太       阳       的       年|

However, in M mode, DSE DUMP print dot characters for all non-ASCII characters and malformed characters.

In UTF-8 mode, DSE and LKE accept characters in Unicode™ in all their command qualifiers that require file names, keys, or data (such as DSE -KEY, DSE -DATA and LKE -LOCK qualifiers).

LKE SHOW now represents canonical numeric subscripts without quotes.

Example:
GTM>l ^A(1)
GTM>zsy
$ lke
LKE> show -all
DEFAULT
^A(1) Owned by PID= 8102 which is an existing process
LKE>GTM>l ^A(1) 

M Utility Routines

The %UTF2HEX and %HEX2UTF M utility routines provide conversions between UTF-8 and hexadecimal code-point representations. Both these utilities run in only in UTF-8 mode; in M mode, they both trigger a run-time error.

%UTF2HEX

The GT.M %UTF2HEX utility returns the hexadecimal notation of the internal byte encoding of a UTF-8 encoded GT.M character string. This routine has entry points for both interactive and non-interactive use.

DO ^%UTF2HEX converts the string stored in %S to the hexadecimal byte notation and stores the result in %U.

DO INT^%UTF2HEX converts the interactively entered string to the hexadecimal byte notation and stores the result in %U.

$$FUNC^%UTF2HEX(s) returns the hexadecimal byte representation of the character string s.

Example:
GTM> SET %S=”AÄB”
GTM> DO ^%UTF2HEX
GTM> ZWRITE %U
%U=”41C38442”
GTM> W $$FUNC^%UTF2HEX(“ABC”)
414243
GTM>

Note that %UTF2HEX provides a similar functionality as the UNIX binary dump utility (od -x).

%HEX2UTF

The GT.M %HEX2UTF utility returns the GT.M encoded character string from the given bytestream in hexadecimal notation. This routine has entry points for both interactive and non-interactive use.

DO ^%HEX2UTF converts the hexadecimal byte stream stored in %U into a GT.M character string and stores the result in %S.

DO INT^%HEX2UTF converts the interactively entered hexadecimal byte stream into a GT.M character string and stores the result in %S.

$$FUNC^%HEX2UTF (s) returns the GT.M character string given the hexadecimal byte stream representation in s.

Example:
GTM> SET %U=”41C3A441”
GTM> DO ^%HEX2UTF
GTM> ZWRITE %S
%S=”AäA”
GTM> W $$FUNC^%HEX2UTF(“414243”)
ABCS
GTM>

Discussion and Best Practices

Data interchange

The support for Unicode™ in GT.M only affects the interpretation of data in databases, and not databases themselves, a simple way to convert from a ZWR format extract in one mode to an extract in the other is to load it in the database using a process in the mode in which it was generated, and to once more extract it from the database using a process in the other mode.

If a sequence of 8-bit octets contains bytes other than those in the ASCII range (0 through 127), an extract in ZWR format for the same sequence of bytes is different in "M" and "UTF-8" modes. In "M" mode, the $C() values in a ZWR format extract are always equal to or less than 255. In "UTF-8" mode, they can have larger values - the code-points of legal characters in Unicode™ can be far greater than 255.

Note that the characters written to the output device are subject to the OCHSET transformation of the controlling output device. If OCHSET is "M", the multi-byte characters are written in raw bytes without any transformation.

  1. Each multi-byte graphic character (as classified by $ZCHSET) is written directly to the device converted to the encoding form specified by the OCHSET of the output device.

  2. Each multi-byte non-graphic character (as classified by $ZCHSET) is written in $CHAR(nnnn) notation, where nnnn is the decimal character code (that is, code-point up to 1114111 if $ZCHSET=”UTF-8” or up to 255 if $ZCHSET="M").

  3. If $ZCHSET="UTF-8" and a subscript or data contains a malformed UTF-8 byte sequence, ZWRITE treats each byte in the sequence as a separate malformed character. Each such byte is written in $ZCHAR(nn[,…]) notation, where each nn is the corresponding byte in the illegal UTF-8 byte sequence.

Note that attempts to use ZWRITE output from a system as input to another system using a different character set may result in errors or not yield the same state as existed on the source system. Application developers can deal with this by defining and using one or more pattern tables that declare all non-ASCII characters (or any useful subset thereof) to be non-graphic (see ). For more details on defining pattern tables, please refer to "Pattern Code Definition" section of "Internationalization" chapter in the GT.M Programmer's Guide.

Limitations

User-defined pattern codes are not supported

Although the M standard patcodes (A,C,L,U,N,P,E) are extended to work with Unicode™, application developers can neither change their default classification nor define the non-standard patcodes ((B,D,F-K,M,O,Q-T,V-X) beyond the ASCII subset. This means that the pattern tables cannot contain characters with codes greater than the maximum ASCII code 127.

String Normalization

In GT.M, strings are not implicitly normalized. Unicode™ normalization is a method of computing canonical representation of the character strings. Normalization is required if the strings contain combination characters (such as accented characters consisting of a base character followed by an accent character) as well as precomposed characters. The Unicode™ standard assigned code-points to such precomposed characters for backward compatibility with legacy code sets. For the applications containing both versions of the same character (or combining characters), Unicode™ recommends one of the normal forms. Because GT.M does not normalize strings, the application developers must develop the functionality of normalizing the strings, as needed, in order for string matching and string collation to behave in a conventional and wholesome fashion. In such a case, edit checks can be used that only accept a single representation when multiple representations are possible.

$PRINCIPAL device encoding is determined at process startup

At process start-up, GT.M implicitly OPENs $PRINCIPAL before any application code is executed, using the encoding specified by $gtm_chset. $PRINCIPAL is never OPENed by any application code. ichset, ochset and chset device parameters are characteristics of the OPEN command rather than the USE command, since an IO device cannot conveniently switch encoding in mid-stream. Therefore, the character set of $PRINCIPAL is determined for the process, and cannot be changed.

One implication of this restriction on $PRINCIPAL (including Terminal, Sequential File and Socket devices) is that UTF-16, UTF-16LE and UTF-16BE encodings are never supported for $PRINCIPAL.

UTF-16 is not supported for Terminal Devices

Due to the uncommon usage and lack of support for UTF-16 by UNIX terminals and terminal emulators, GT.M does not support UTF-16, UTF-16LE and UTF-16BE encodings for Terminal I/O devices. Note that UNIX platforms use UTF-8 as the defacto character encoding for Unicode™. The terminal connections from remote hosts (such as Windows) must communicate with GT.M in UTF-8 encoding.

Error messages are in [American] English

GT.M has no facility for a translation of product error messages or on-line help into languages other than [American] English. All error message text (except the messages arguments that could include Unicode™ data) is in the [American] English language.

Performance and Capacity

With the use of "UTF-8" as GT.M’s internal character encoding, the additional requirements for CPU cycles, excluding collation algorithms, should not increase significantly compared with the identical application using the "M" character set. Additional memory requirements for "UTF-8" vary depending on the application as well as the actual character set used. For example, applications based on Latin-1 (2-byte encoded) characters may require up to twice the memory and those based on Chinese/Japanese (3-byte encoded) characters may require up to three times the memory compared to an identical application using "M" characters. The additional disk-space and I/O performance trade-offs for "UTF-8" also vary based on the application and the characters used.

Characters in arguments exchanged with external routines must be validated by the external routines

GT.M does not check for illegal characters in a string before passing it to an external routine or in a returned value before assigning it to a GT.M variable. This is because such checks add parameter-processing overhead. The application must ensure that the strings are in the encoding form expected by the respective routines. More robustly, external routines must interpret passed strings based on the value of the intrinsic variable $ZCHSET or the environment variable gtm_chset. The external routines can perform validation if needed.

Maximums

In the prior versions of GT.M, the restrictions on certain objects were put in place with the assumption that a character is represented by a single byte. With support for Unicode™ enabled in GT.M, the following restrictions are now in terms of bytes—not characters.

M Name Length

The maximum length of an M identifier is restricted to 31 bytes. Since identifier names are restricted to be in ASCII, programmers can define M names up to 31 characters long.

M String Length

The maximum length of an M string is restricted to 1,048,576 bytes. Therefore, depending on the characters used, the maximum number of characters could be reduced from 1,048,576 (1M) characters to as few as 262,144 (256K) characters.

M Source Line Length

The maximum length of a program or indirect source line is restricted to 2,048 bytes. Application developers must be aware of this byte limit if they consider using multi-byte source comments or string literals in a source line.

Database Key and Record Sizes

The maximum allowed size for database keys (both global and nref keys) is 255 bytes, and for database records is 32K bytes. Application developers must be aware that the keys or data containing multi-byte characters in Unicode™ are limited at a smaller number of characters than the number of available bytes.

Ten Golden Rules

Adhere to the following rules of thumb to design and develop Unicode™-based applications for deployment on GT.M.

  1. GT.M functionality related to Unicode™ becomes available only in UTF-8 mode.

  2. [At least] in UTF-8 mode, byte manipulation must use Z* equivalent functions.

  3. In M mode, standard functions are always identical to their Z equivalents.

  4. Use the same character set for all globals names and subscripts in an instance.

  5. Define a collation system according to the linguistic and cultural tenets of the language used.

  6. Create the application logic to ensure strings used as keys are canonical.

  7. Specify CHSET=”M” or otherwise handle illegal characters during the I/O operations.

  8. Communicate with any external routines using a compatible character encoding form.

  9. Compile and run programs in the same setting of $ZCHSET and "BADCHAR".

  10. Read the technical bulletin and the GT.M Programmer's Guide carefully. When in doubt, consult GTM Support (gtm.support@fnf.com).