To represent character encoding names in M-Strings I created a small API inspired by JavaScript Symbols. After some thought I realized it was only loosely related to M-Strings and could be useful in other contexts.
Symbols
Symbols are unique, interned values. They’re used for significant and often-used strings like character encoding names, HTTP method names, function and variable names, etc.
Symbols cannot be used like C strings. Instead use
C_Symbol_as_cstring(sym)
.
API
#include <stdbool.h>
#include <syst/types.h>
typedef unsigned char utf8_t;
/**
* A unique value sometimes tied to a string value.
* The mapping to strings, if any, resides in a thread-safe global hashtable.
*/
typedef struct _C_Symbol* C_Symbol;
/**
* Determines whether `p` is a C_Symbol.
* The implementation checks whether `p` is in the right memory range instead
* of dereferencing it, in case it points to an invalid memory location.
*/
bool is_C_Symbol(void* p);
/**
* Return a new Symbol value unique in the current instance of the host program.
* The value isn't guaranteed to be unique across all parallel or future
* instances in memory.
* The Symbol has no corresponding string value.
*/
C_Symbol C_Symbol_new();
/**
* Return a Symbol value unique in current memory, indexed by `cstr`.
* Since `cstr` is a C string, it cannot contain embedded nulls.
* By convention, symbols don't contain whitespace or non-printable characters,
* but this is not a hard and fast rule.
*/
C_Symbol C_Symbol_for_cstring(const char* cstr);
/**
* Return a Symbol value unique in current memory,
* indexed by a UTF-8 string of length 'len' starting at `uptr`.
* The string may contain embedded nulls.
* By convention, symbols don't contain whitespace or non-printable characters,
* but this is not a hard and fast rule.
*/
C_Symbol C_Symbol_for_utf8_string(size_t len, const utf8_t* uptr);
/**
* The value of the string used to create the symbol.
* This is a copy of the string passed into `C_Symbol_for_cstring`
* or `C_Symbol_for_utf8_string` with embedded nulls replaced with '\xC0\x80',
* "" if the symbol was created by `C_Symbol_new`,
* or NULL if the symbol value is invalid (e.g. another type of object).
*/
const char* C_Symbol_as_cstring(C_Symbol sym);
/**
* Copy the first `len` bytes of the string used to create `sym` into `buf`.
* The string may contain embedded nulls.
* The return value will be the number of UTF-8 bytes written to `buf`;
* if 0 the symbol has no corresponding string value,
* and if negative `sym` is not a C_Symbol.
*/
ssize_t C_Symbol_as_utf8_string(C_Symbol sym, size_t len, utf8_t* buf);