CConv | Frank Mitchell's Blog

While planning CPort, M-String, and other projects I identified a need for some even more basic portable utilities. Both need an API to convert characters from one character set to another.

Front End

UTF Encodings

cconv_is_ascii checks the first sz bytes in buf and returns TRUE if they are all ≤ 127, FALSE otherwise.

All cconv_utfm_to_n_length functions calculate how many UTF-n characters the given UTF-m characters will produce. Incomplete codepoints won’t be counted. (*csz), if given, is set to the number of input characters contributed to the count of characters that would be written. Using these functions will aid in sizing output buffers correctly.

All cconv_utfm_to_n functions perform the actual conversion. Incomplete codepoints won’t be converted. They return the number of characters written to the output buffer.

UTF-16 and UTF-32 use the platform’s byte ordering, so they’re mostly for internal use.

#include <stdbool.h>    /* defines `bool` */
#include <stdint.h>     /* defines `int`*N*`_t */
#include <sys/types.h>  /* defines `size_t` and `ssize_t` */

bool cconv_is_ascii(size_t sz, uint8_t* buf);

size_t cconv_utf8_to_16_length(size_t sz, uint8_t* buf, size_t *csz);

size_t cconv_utf8_to_32_length(size_t sz, uint8_t* buf, size_t *csz);

size_t cconv_utf16_to_8_length(size_t sz, uint16_t* buf, size_t *csz);

size_t cconv_utf32_to_8_length(size_t sz, uint32_t* buf, size_t *csz);

size_t cconv_utf8_to_16(size_t insz, uint8_t* inbuf, 
                        size_t outsz, uint16_t* outbuf);

size_t cconv_utf8_to_32(size_t insz, uint8_t* inbuf, 
                        size_t outsz, uint32_t* outbuf);

size_t cconv_utf16_to_8(size_t insz, uint16_t* inbuf, 
                        size_t outsz, uint8_t* outbuf);

size_t cconv_utf32_to_8(size_t insz, uint32_t* inbuf,
                        size_t outsz, uint8_t* outbuf);

Other Encodings

By default the Converter only understands “ASCII” (i.e. code points 0x00-0x7F), “ISO-8859-1” (i.e. code points 0x00-0xFF in single bytes), “UTF-8” (strict: minimum number of bytes, null is 0x00), “UTF-16” (strict: no unpaired surrogates, null is 0x0000), and “UTF-32”. Other single-byte encodings may be baked into future releases. In the general case, the following functions will require use of a “back end” converter.

cconv_decode_codepoint decodes the character under (*inptr), reading no more than insz characters, and returns either the decoded character or -1 if the bytes couldn’t be decoded.

The rest function much like cconv_utfm_to_n to perform conversions between encoded bytes and UTF-8 or UTF-32. The return value is the number of bytes decoded. If the bytes couldn’t be encoded or decoded, typically because no suitable conversion function exists, the functions will return -1.

int32_t cconv_decode_codepoint(const char* encoding, 
                               size_t insz, uint8_t* inptr);

ssize_t cconv_decode_utf8(const char* encoding,
                            size_t insz, uint8_t* inbuf,
                            size_t outsz, uint8_t* outbuf);

ssize_t cconv_decode_utf32(const char* encoding,
                            size_t insz, uint8_t* inbuf,
                            size_t outsz, uint32_t* outbuf);

ssize_t cconv_encode_utf8(const char* encoding,
                            size_t insz, uint8_t* inbuf,
                            size_t outsz, uint8_t* outbuf);

ssize_t cconv_encode_utf32(const char* encoding,
                            size_t insz, uint32_t* inbuf,
                            size_t outsz, uint8_t* outbuf);

ssize_t cconv_transcode(const char* incode, 
                        const char* outcode,
                        size_t insz, uint8_t* inbuf,
                        size_t outsz, uint8_t* outbuf);

Back End

Linux and many Unixes implement these functions with iconv. Systems that use another library for general conversion like Windows will use preprocessor switches to link in their own equivalent libraries.

If a platform lacks a general converter, a configure script can define “mapping functions” to convert characters in various encodings to and from Unicode code points.