While planning CPort, M-String, and other projects I identified a need for some even more basic portable utilities. Both need an API to convert characters from one character set to another.
Front End
UTF Encodings
cconv_is_ascii
checks the first sz
bytes in buf
and returns TRUE
if they are all ≤ 127, FALSE
otherwise.
All cconv_utf
m_to_
n_length
functions
calculate how many UTF-n characters the given UTF-m characters will produce.
Incomplete codepoints won’t be counted.
(*csz)
, if given, is set to the number of input characters
contributed to the count of characters that would be written.
Using these functions will aid in sizing output buffers correctly.
All cconv_utf
m_to_
n functions perform the actual conversion.
Incomplete codepoints won’t be converted.
They return the number of characters written to the output buffer.
UTF-16 and UTF-32 use the platform’s byte ordering, so they’re mostly for internal use.
#include <stdbool.h> /* defines `bool` */
#include <stdint.h> /* defines `int`*N*`_t */
#include <sys/types.h> /* defines `size_t` and `ssize_t` */
bool cconv_is_ascii(size_t sz, uint8_t* buf);
size_t cconv_utf8_to_16_length(size_t sz, uint8_t* buf, size_t *csz);
size_t cconv_utf8_to_32_length(size_t sz, uint8_t* buf, size_t *csz);
size_t cconv_utf16_to_8_length(size_t sz, uint16_t* buf, size_t *csz);
size_t cconv_utf32_to_8_length(size_t sz, uint32_t* buf, size_t *csz);
size_t cconv_utf8_to_16(size_t insz, uint8_t* inbuf,
size_t outsz, uint16_t* outbuf);
size_t cconv_utf8_to_32(size_t insz, uint8_t* inbuf,
size_t outsz, uint32_t* outbuf);
size_t cconv_utf16_to_8(size_t insz, uint16_t* inbuf,
size_t outsz, uint8_t* outbuf);
size_t cconv_utf32_to_8(size_t insz, uint32_t* inbuf,
size_t outsz, uint8_t* outbuf);
Other Encodings
By default the Converter
only understands
“ASCII” (i.e. code points 0x00-0x7F),
“ISO-8859-1” (i.e. code points 0x00-0xFF in single bytes),
“UTF-8” (strict: minimum number of bytes, null is 0x00),
“UTF-16” (strict: no unpaired surrogates, null is 0x0000),
and “UTF-32”.
Other single-byte encodings may be baked into future releases.
In the general case, the following functions will require use
of a “back end” converter.
cconv_decode_codepoint
decodes the character under (*inptr)
,
reading no more than insz
characters,
and returns either the decoded character
or -1 if the bytes couldn’t be decoded.
The rest function much like cconv_utf
m_to_
n
to perform conversions between encoded bytes and UTF-8 or UTF-32.
The return value is the number of bytes decoded.
If the bytes couldn’t be encoded or decoded,
typically because no suitable conversion function exists,
the functions will return -1.
int32_t cconv_decode_codepoint(const char* encoding,
size_t insz, uint8_t* inptr);
ssize_t cconv_decode_utf8(const char* encoding,
size_t insz, uint8_t* inbuf,
size_t outsz, uint8_t* outbuf);
ssize_t cconv_decode_utf32(const char* encoding,
size_t insz, uint8_t* inbuf,
size_t outsz, uint32_t* outbuf);
ssize_t cconv_encode_utf8(const char* encoding,
size_t insz, uint8_t* inbuf,
size_t outsz, uint8_t* outbuf);
ssize_t cconv_encode_utf32(const char* encoding,
size_t insz, uint32_t* inbuf,
size_t outsz, uint8_t* outbuf);
ssize_t cconv_transcode(const char* incode,
const char* outcode,
size_t insz, uint8_t* inbuf,
size_t outsz, uint8_t* outbuf);
Back End
Linux and many Unixes implement these functions with iconv. Systems that use another library for general conversion like Windows will use preprocessor switches to link in their own equivalent libraries.
If a platform lacks a general converter, a configure script can define “mapping functions” to convert characters in various encodings to and from Unicode code points.