1. GLS Library String and Character Termination
The GLS library functions are intended to be used in many different
contexts. In particular, some APIs that programmers will use along with
the GLS library assume that all character strings are terminated with a
null character, others assume that each string consists of a pointer
and length which indicates the number of bytes in the string. The GLS
library is intended to be used with both.
Therefore, each of the GLS library functions that take a string
argument allow you to pass them either a null-terminated string, or a
string whose end is determined by the separate length that you pass
them.
Multi-Byte Character String Termination
Each multi-byte character string that is passed to a GLS library function
is represented by two arguments,
..., mbs, mbs_byte_length, ...
If mbs_byte_length is the value
IFX_GL_NULL
then the function
will assume that mbs is a null-terminated string; otherwise the function assumes that mbs_byte_length is the number of bytes in the multi-byte
character string.
The null-terminator of a multi-byte string consists of one byte whose value
is zero.
Multi-byte character strings which are not null-terminated are called length-terminated
multi-byte strings and can contain null characters, but these null characters do not
indicate the end of the string.
If mbs_byte_length is neither IFX_GL_NULL
nor greater than or equal to zero, then the function gives the IFX_GL_PARAMERR error.
Multi-Byte Character Termination
Many GLS library functions operate on just one multi-byte character.
Each multi-byte character that is passed to a GLS library function
is represented by two arguments,
..., mb, mb_byte_limit, ...
If mb_byte_limit is IFX_GL_NO_LIMIT then the function
will read as many bytes as necessary from mb to form a complete
character; otherwise, it will not read more than mb_byte_limit bytes
from mb when trying to form a complete character.
1. If mb is a character in a null-terminated multi-byte string,
then mb_byte_limit must be equal to
IFX_GL_NO_LIMIT.
For example, if mbs points to a string of multi-byte characters that
are
null terminated,
for ( mb = mbs; *mb != '\0'; mb += bytes )
{
if ( (bytes = ifx_gl_mblen(mb, IFX_GL_NO_LIMIT)) == -1 )
/* handle error */
}
2. If mb is a character in a multi-byte
string which is not null-terminated or a character in a buffer by itself, then mb_byte_limit must be equal
the number of bytes between where mb points and the end
of the buffer which holds the string or character. For example, if mbs
points to a string of multi-byte characters that are not null
terminated and mbs_bytes is the number of bytes in that string,
for ( mb = mbs; mbs_bytes > 0; mb += bytes, mbs_bytes -= bytes )
{
if ( (bytes = ifx_gl_mblen(mb, mbs_bytes)) == -1 )
/* handle error */
}
or if mb points to one multi-byte character and mb_bytes is the number
of bytes in the buffer that holds the character,
if ( (bytes = ifx_gl_mblen(mb, mb_bytes)) == -1 )
/* handle error */
If the function cannot determine whether mb is a valid multi-byte character, because
it would need to read more than
mb_byte_limit bytes from mb or if mb_byte_limit is less than or equal to zero, then the function gives
the IFX_GL_EINVAL error.
Wide-Character String Termination
Each wide-character string that is passed to a GLS library function
is represented by two arguments,
..., wcs, wcs_char_length, ...
If wcs_char_length is the value
IFX_GL_NULL
then the function
will assume that wcs is a null-terminated string; otherwise the function assumes that wcs_char_length is the number of characters in the wide-character
string.
The null-terminator of a wide-character string consists of one
gl_wchar_t whose value is zero.
Wide-character strings which are not null-terminated are called length-terminated
wide-character strings and can contain null characters, but these null characters do not
indicate the end of the string.
If wcs_char_length is neither IFX_GL_NULL
nor greater than or equal to zero, then the function gives the IFX_GL_PARAMERR error.
2. GLS Library Memory Allocation
Memory Allocation by GLS Library Functions
No GLS library function allocates memory that remains after the
function returns. If a function allocates memory, this memory is only
for temporary purposes and is freed before the function returns.
Therefore, the caller of each function must allocate any memory needed
by the function.
Memory Allocation by GLS Library Callers
Multi-byte character string allocation
Since the number of array elements in a multi-byte character string does NOT equal the number of characters in the string, the allocation of a multi-byte character string is NOT the same as the "old" single-byte method. For example, to statically allocate 20 multi-byte characters use,
gl_mchar_t mbs[20*IFX_GL_MB_MAX];
To dynamically allocate 20 multi-byte characters use,
gl_mchar_t *mbs = (gl_mchar_t *) malloc(20*IFX_GL_MB_MAX);
or to dynamically allocate a more precise estimate use,
gl_mchar_t *mbs = (gl_mchar_t *) malloc(20*ifx_gl_mb_loc_max());
To statically allocate 20 multi-byte characters plus a null-terminator use (note that the null-terminator only requires one byte),
gl_mchar_t mbs[20*IFX_GL_MB_MAX+1];
To dynamically allocate 20 multi-byte characters plus a null-terminator use,
gl_mchar_t *mbs = (gl_mchar_t *) malloc(20*IFX_GL_MB_MAX+1);
or to dynamically allocate a more precise estimate use,
gl_mchar_t *p = (gl_mchar_t *) malloc(20*gl_mb_loc_max()+1);
Wide-Character String Allocation
Since the number of array elements in a wide-character string equals the number of characters in the string, the static allocation of a wide-character string looks the same as the "old" single-byte method. For example, to statically allocate 20 wide-characters use,
gl_wchar_t wcs[20];
To dynamically allocate 20 wide-characters use,
gl_wchar_t *wcs = (gl_wchar_t *) malloc(20*sizeof(gl_wchar_t));
To statically allocate 20 wide-characters plus a null-terminator use (note that the null-terminator requires the space allocated for an entire wide-character),
gl_wchar_t wcs[21];
To dynamically allocate 20 wide-characters plus a null-terminator use,
gl_wchar_t *wcs = (gl_wchar_t *) malloc(21*sizeof(gl_wchar_t));
3. Keeping Multi-Byte Strings Consistent
Truncating Long Multi-Byte Strings
Sometimes the caller of GLS library functions will need to truncate a
long character string so that it fits into a smaller buffer. Truncating a
string that consists of just single-byte characters is easy. This is because truncating
at an arbitrary
byte location in the string will still
result in a complete character string, albeit shorter.
However, truncating a string that can contain even one multi-byte
character is difficult. This is because truncating at an arbitrary byte location
in the string can result in truncating a multi-byte character in its
middle such that the truncated string ends with the first 1, 2 or 3
bytes of a character without the character's remaining bytes.
If such a situation occurs, then subsequent traversal of the truncated
string could result in reading beyond the end of the buffer.
Therefore, all GLS library functions which traverse one multi-byte character or traverse length-terminated
multi-byte characters strings give a special error if they
detect that an otherwise valid character has been truncated:
IFX_GL_EINVAL.
If it is known that no truncation occurred to the string, then IFX_GL_EINVAL
can be considered the same as IFX_GL_EILSEQ. However, if it is possible that
truncation has occurred, then IFX_GL_EINVAL indicates to the caller that they need to
further truncate
the string so that the last byte of the string is the last byte of the last character in
the string.
Depending upon your application, you may either end up making the
truncated string even shorter than originally indented or you may have
to replace the first 1, 2, or 3 bytes of the truncated character with a
padding character that is appropriate for your application.
Even though the GLS library functions can be used to detect this
situation after it has occurred, it is much better to use them to avoid
the situation.
Fragmenting Long Multi-Byte Strings
Sometimes the caller of GLS library functions will need to fragment a
long character string into two or more non-adjacent buffers to meet the memory
management requirements of their component. Fragmenting a string that
consists of just single-byte characters is easy. This is because fragmenting
at arbitary byte locations in the string will still result in the fragments
being consistent
character strings.
However, fragmenting a string that can contain even one multi-byte
character is difficult. This is because fragmenting at arbitrary byte
locations in the string can result in fragmenting a multi-byte
character in its middle such that one fragment ends with the first 1, 2
or 3 bytes of a character and the next fragment starts with the remaining
bytes.
If the only thing you ever will do with these fragments is to concatenate
them back together to form one string, then no special processing needs
to be done. However, if you traverse the fragments as multi-byte
strings, this can result in reading beyond the end of one fragment or
finding an illegal character at the beginning of another.
Therefore, all GLS library functions which traverse one multi-byte character or traverse length-terminated
multi-byte characters strings give a special error if they
detect that an otherwise valid character has been truncated at the end of
a fragment:
IFX_GL_EINVAL. It is impossible to detect that the beginning of a fragment contains
the remaining bytes of the last character in the previous fragment without
looking at the previous fragment first. This is because the last 1, 2 or 3 bytes
of a multi-byte character may look exactly like a valid character.
If it is known that no fragmentation occurred to the string, then IFX_GL_EINVAL
can be considered the same as IFX_GL_EILSEQ. However, if it is possible that
fragmentation has occurred, then IFX_GL_EINVAL indicates to the caller that
they need to fragment
the string so that the last byte of each fragment is the last byte of the last character in
the fragment and so that the first byte of each fragment is the first byte of the first character in the fragment.
Depending upon your application, you may either end up making
a fragment even shorter than originally indented or you may have to replace
the first 1, 2, or 3 bytes of the fragmented character with a padding character
that is appropriate for your application and shift these bytes to the beginning
of the next fragment.
Even though the GLS library functions can be used to detect this
situation after it has occurred, it is much better to use them to avoid
the situation.
ACKNOWLEDGEMENT
Portions of this description were derived from the X/Open CAE
Specification: "System Interfaces and Headers, Issue 4"; X/Open
Document Number: C202; ISBN: 1-872630-47-2; Published by X/Open Company
Ltd., U.K.
|