M-String | Frank Mitchell's Blog

Purpose

The M-String library provides C with two features present in higher-level languages:

strings that are full-featured objects.
memory management.

Right now memory management is limited to said strings, but a future release will provide a general Memory Manager that can handle arbitrary blocks of memory.¹ The author chose strings as a test case because a garbage collector for strings is almost trivial, the lack of memory-managed strings was a particular pain point of the author’s, and if the interface between managed and un-managed code worked for ubiquitous strings it would work for anything.

Concepts

Polymorphic Strings

Notionally an M_String contains a sequence of characters, realized as Unicode code points.² Since the largest code point defined to date is 0x10FFFF, each character would take 21 bits to represent, or for efficiency an entire 32-bit word. That’s way too much.

Instead the code represents strings in multiple ways:

emptystring: The empty string is an immediate value. That is, instead of pointing to a data structure, its identity is encoded in the pointer itself.
smallstring: Each string with one character is also an immediate value represented within 32 bits.
utf8string: Most multi-character strings use UTF-8, a variable encoding that represents ASCII characters 0-127 as one byte and all currently defined characters with no more than four bytes.
encodedstring: Strings read in with a character set besides Unicode or ASCII remain in that encoding, tagged with the name of the character set. Depending on the translation, the implementation may generate a parallel UTF-8 sequence or simply convert on the fly. For example, “ISO-Latin-1” is simply the first 256 elements of Unicode stored in single bytes. Another custom library, CConv performs all character conversions.

From the outside, though, all strings look like a sequence of Unicode code points. One can only tell a string’s format from its length and its encoding attribute (“UTF-32” with length ≤ 1, “ASCII” or “UTF-8”, or other).

Immutable Strings

The author chose to implement the immutable strings of Java, Lua, Python, and various functional programming languages instead of the mutable strings of C, Eiffel, and Ruby for a few reasons:

Immutable objects can be passed between threads without synchronization. In theory, at least; see below.
Given their ubiquity, one wants strings to have the same value semantics as numbers, booleans, and other common types. Here, “value semantics” means that the data is an anonymous bit-pattern in an infinite space of bit-patterns, not an “object” with mutable state and behavior. One can change a variable’s value from three to four, but one can’t change “four” to mean 5. (Except in Fortran sometimes.)
Many times one wants a string value to behave as a constant. Mutable strings must implement a way to “freeze” the string, which disables all routines that change the string. The author believes this design pattern is sub-optimal:
- Often the API to deal efficiently with the string requires changing it in place. A programmer must therefore copy the immutable string to a mutable one before changing it.
- If called, disabled methods either throw an exception or do nothing. In a statically-typed language using dynamic typing to signal that an operation isn’t possible, either quietly or loudly, seems fundamentally wrong.
- Often there’s no way to tell whether an object is mutable save by attempting to mutate it. Even if there is an is_mutable query both mutable and immutable versions have the same class and can substitute for each other … even though they really can’t.
The author prefers the pattern for Java string-related classes: all classes – String, StringBuffer, StringBuilder, etc. – implement the CharSequence interface for their common operations, but String is immutable. All others exist to build Strings or otherwise substitute for them in specific instances.

One can implement correct if not always efficient operations through the basic operations of inspecting, slicing, and joining strings. The implementation also handles the empty string and strings of one character efficiently. (Or it will do so at some point.) Future versions will use the ropes or cords approach of representing long strings as trees of sliced and joined segments.

For more complicated transformations not provided in the library, M_Char_Buffer may satisfy some but not all needs. The author suggests the M_String_to_{byte,utf8,utf32}_array functions which copy the contents to a buffer of UTF-8, UTF-32, or internally encoded bytes. From there one can reverse them, parse them, or do whatever else needed with C-like efficiency then create a new string with the results.

Memory Management

Unlike most structures in C which require manual memory management, the point of M-Strings is that users can create them freely and depend on the M-Pool to remove them when they’re no longer needed. In order to do its work, however, an M-Pool needs cooperation from application code. This cooperation comes from reference counting and marking.

Reference Counting

Whenever the application stores an M-String in a variable, it should “retain” the string like so:

M_String s = M_String_retain(M_String_new_from_cstring(pool, "foo"));

Likewise, when the application no longer needs it, it should “release” it:

MString s = M_String_release(s):

The ...set function handles reference counts correctly.³

M_String s = M_REF_NULL;  /* 0 will also work */
M_String_set(&s, M_String_new_from_cstring(pool, "foo"));

One can check the exact reference count with M_String_references(...);

Marking

Once an M-String’s reference count goes to 0, the M-String isn’t immediately collected. If it’s held in a Scope it won’t be collected until that Scope ends. M-Pools also reserve collection until the application creates another M-String or explicitly calls M_Pool_clean(...).

In both cases, the M-Pool does a full mark-and-sweep garbage collection cycle to verify that an M-String isn’t being used somewhere.

Applications that hang onto M-Strings without reference counting can participate in garbage collection with something like the following:

struct App_Struct {
    ...
    M_String name;
    ...
}

void mark_app_name(App_Struct* app, M_Root_Set set) {
    M_Root_Set_add(set, app->name);
}

/* somewhere in the initialization code */
App_Struct app;
M_Pool p;

...

app->name = M_String_new_from_cstring(pool, "MyApplication");
M_Pool_add_marker(p, app, &mark_app_name);

...

Scopes

Sometimes the application will create a bunch of temporary strings. To keep them from being reclaimed while the application is using them,

void parse_string(M_String_Array tokens, M_String text) {
    M_Scope scope;

    M_Scope_begin(&scope, M_String_pool(text));

    /*
     * Multiple instances of M_String slicing and joining redacted
     * to avoid offending those of a delicate disposition.
     */

    M_Scope_end(&scope);
}

Only the M-Strings appended to tokens will survive garbage collection. The rest will be collected shortly after M_Scope_end(&scope);

Multithreading

The default implementation of a Memory Pool assumes it is thread-local. That is, only one thread will use a Memory Pool or its M-Strings. To move strings between Pools, one can simply copy strings. Identical copies are, for all practical purposes, the same string.

A future release will implement a fully thread-safe Memory Pool usable for all strings in a multithreaded application.

API

The M-String API will look something like this.

Types

Note that this library uses CSymbol.

/* defined in C99 */
#include <stdbool.h> /* defines bool */
#include <stddef.h>  /* defines int */
#include <stdint.h>  /* defines uint8_t, utf16_t, uintptr_t */
#include <wchar.h>   /* wchar_t */

#include "csymbol.h"

typedef uint8_t  octet_t;
typedef uint8_t  utf8_t;
typedef uint16_t utf16_t;

/*
 * Types implemented as opaque references.
 */
typedef uintptr_t M_String;

const M_String M_REF_NULL = 0;

/*
 * Types implemented as (hidden) structs.
 */
typedef struct _M_Char_Buffer*  M_Char_Buffer;
typedef struct _M_Pool*         M_Pool;
typedef struct _M_Root_Set*     M_Root_Set;
typedef struct _M_Scope*        M_Scope;
typedef struct _M_String_Array* M_String_Array;

Memory Pool

/**
 * Create a reentrant Memory Pool to manage M-String memory
 * and write the reference into `(*pptr)`.
 * The current thread should hang onto this reference until it's about
 * to exit, then call `M_Pool_close` on it.
 */
void M_Pool_new(M_Pool *pptr);

/**
 * Collect all strings not in use.
 * "In-use", here, means either with a positive reference count
 * *or* referenced in managed memory and cooperating non-managed memory.
 * Typically a thread will call this function in a main loop or other
 * place where the application is waiting for user interaction.
 */
void M_Pool_clean(MPool pool);

/**
 * Delete the Memory Pool and clear the reference in `(*pptr)`.
 * This halts any calls to M_Markers and
 * deallocates all M_Strings, M_Char_Buffers, and M_Scopes from this pool.
 */
void M_Pool_release(M_Pool *pptr);

Error

typedef enum _M_String_Error {
   STATUS_OK = 0,
   ERROR_NULL_POINTER,
   ERROR_UNKNOWN_CHARSET,
   ERROR_STRING_ENCODING,
   /* more TBD ...*/
   ERROR_UNDEFINED = 0xFFFF
} M_String_Error;

bool           M_Pool_has_error(M_Pool pool);
M_String_Error M_Pool_error_code(M_Pool pool);
const char*    M_Pool_error_detail(M_Pool pool);

String

Note that this library uses CSymbol.

Creating

All functions but one take a sz parameter with the number of elements to be converted to characters in an M-String.

...from_cstring assumes a null-terminated string.

Each method returns M_REF_NULL and sets the M-Pool’s error codes if the arguments or the character encoding is invalid. Otherwise they return a valid M-String.

M_String M_String_new_ascii(M_Pool pool, size_t sz, const char* buf);

M_String M_String_new_utf8(M_Pool pool, size_t sz, const utf8_t* buf);

M_String M_String_new_utf16(M_Pool pool, size_t sz, const utf16_t* buf);

M_String M_String_new_utf32(M_Pool pool, size_t sz, const wchar_t* buf);

M_String M_String_new_encoded(M_Pool pool,
                              C_Symbol encoding,
                              size_t sz,
                              const octet_t* buf);

/* Create from a null-terminated string */
M_String M_String_new_from_cstring(M_Pool pool, const char* cstr);

Inspecting

Provides the character at the given index, the internal character encoding of the string, and the length of the string in Unicode code points, respectively.

wchar_t     M_String_char_at(M_String s, size_t i);
C_Symbol    M_String_encoding(M_String s);
const char* M_String_encoding_name(M_String s);
size_t      M_String_length(M_String s);

Provides the Memory Pool containing the string.

M_Pool M_String_pool(M_String s);

Copying

These functions copy part or all of a string into a buffer.

...to_byte copies characters as bytes of the string’s encoding.
...to_utf8 copies characters as a sequence of “UTF-8” bytes.
...to_utf32 copies characters as 32-bit Unicode code points.

Each takes an offest into the M-String, the maximum number of bytes to copy, and a pointer to a buffer (which should accomodate len bytes).

All return the number of actual bytes copied; the functions will only copy complete characters in a multi-byte or variable length character set.

size_t M_String_to_byte(M_String s,  size_t offset, size_t max, octet_t* buf);
size_t M_String_to_utf8(M_String s,  size_t offset, size_t max, utf8_t* buf);
size_t M_String_to_utf32(M_String s, size_t offset, size_t max, wchar_t* buf);

Iterating

This calls the function iter over each character, passing in the character, its index, and the user-provided data structure.

C lacks the ability to create functions within functions and the sort of lexical scoping rules to make this API truly useful. The data pointer only partly compensates.

void M_String_each(M_String s, 
                    void* data, 
                    void (*iter)(void*, size_t, wchar_t));

Slicing

Slicing creates a substring starting at the character indicated by first and ending on the character indicated by last (inclusive). The length of the string will be last - first, assuming both are positive and last >= first. If last < first, the length of the string will be 0. If first or last are negative, they indicate positions relative to the end of the string; subtract them from the length to get their absolute positions. (-1 is the last character of the string.)

M_String M_String_slice(M_String s, int first, int last);
M_String M_String_slice_from(M_String s, int first);
M_String M_String_slice_to(M_String s, int last);

Joining

Joining strings creates a single string with the sub-strings concatenated together.

M_String M_String_join(M_String head, M_String tail);
M_String M_String_join_n(size_t n, ...);

Bookkeeping

Methods to perform reference counting. Adding/removing references may invisibly change the string’s internal state (e.g. copy-on-retain to convert an external buffer to an internal one) but shouldn’t change the string’s value.

is_live returns whether the argument is still a usable object. If false, the M-String has been deallocated and garbage collected.

references returns the strings current reference count. This may be 0 if the string is either newly created or was just released within a Scope.

retain increments the reference count and returns a possibly altered reference. The new reference may contain encoded information but it points to the same M-String value (and usually the same internal object).

release decrements the reference count and returns a possibly altered reference. Usually this is to preserve the string while it’s returned from a function.

set is a convenience function to set (*lvalue) to rvalue while managing reference counts correctly. It always returns rvalue. It’s roughly analogous to C++’s “smart pointers”, but less convenient.

bool         M_String_is_live(M_String s);
unsigned int M_String_references(M_String s);

M_String     M_String_retain(M_String s);
M_String     M_String_release(M_String s);
M_String     M_String_set(M_String *lvalue, M_String rvalue);

Scopes

When a client produces a lot of temporary M-Strings in a procedure, they may want to pause garbage collection of those strings until the procedure, or a critical block therein, completes. The client would demarcate a “scope” that pauses collection for those strings. When the scope ends, garbage collection of those strings resumes.

A variable of type M_Scope represents the scope. It’s set when the scope begins, and cleared when it ends.

Nested scopes behave as one would assume, but if one scope ends before one or more scopes that it should be nested within, all those scopes will end at once. This prevents scopes from surviving indefinitely if a programmer forgot to end a scope. Begin a scope at the top of a procedure or code block, and end it at the end of the procedure or code block, and all should be well.

void M_Scope_begin(M_Scope *scope, M_Pool pool);

void M_Scope_end(M_Scope *scope);

Char Buffer

An array of Unicode code points, or, colloquially, chars.

void M_Char_Buffer_new(M_Char_Buffer *newref);

void M_Char_Buffer_new_from_string(M_Char_Buffer *newref, M_String str);

size_t M_Char_Buffer_length(M_Char_Buffer self);

wchar_t M_Char_Buffer_get(M_Char_Buffer self, int index);

M_String M_Char_Buffer_get_slice(M_Char_Buffer self, 
                                    int first, 
                                    int last);

void M_Char_Buffer_append(M_Char_Buffer self, wchar_t c);

void M_Char_Buffer_append_string(M_Char_Buffer self, M_String s);

void M_Char_Buffer_insert(M_Char_Buffer self, int index, wchar_t c);

void M_Char_Buffer_insert_string(M_Char_Buffer self, int index, M_String s);

void M_Char_Buffer_set(M_Char_Buffer self, int index, wchar_t c);

void M_Char_Buffer_set_slice(M_Char_Buffer self,
                                int first,
                                int last, 
                                M_String str);

void M_Char_Buffer_set_string(M_Char_Buffer self, int index, M_String s);

M_String M_Char_Buffer_to_string(M_Char_Buffer self, M_Pool pool);

Unlike M-Strings, Char Buffers aren’t garbage collected, merely reference counted with the following functions. Once a String Buffer’s reference count dips to zero, it’s immediately reclaimed. It’s safer to pass it out of a function in a pointer passed into a function than return it from the function.

bool M_Char_Buffer_is_live(M_Char_Buffer b);

void M_Char_Buffer_retain(M_Char_Buffer *bptr);

void M_Char_Buffer_release(M_Char_Buffer *bptr);

void M_Char_Buffer_set(M_Char_Buffer *lvalue, M_Char_Buffer rvalue);

String Array

An array of Strings.

void M_String_Array_new(M_String_Array *newref);

size_t M_String_Array_size(M_String_Array self);

M_String M_String_Array_get(M_String_Array self, int index);

void M_String_Array_append(M_String_Array self, M_String s);

void M_String_Array_insert(M_String_Array self, int index, M_String s);

void M_String_Array_set(M_String_Array self, int index, M_String s);

Unlike M-Strings, String Arrays aren’t garbage collected, merely reference counted with the following functions. Once a String Array’s reference count dips to zero, it’s immediately reclaimed. It’s safer to pass it out of a function in a pointer passed into a function than return it from the function.

bool M_String_Array_is_live(M_String_Array a);

void M_String_Array_retain(M_String_Array *aptr);

void M_String_Array_release(M_String_Array *aptr);

void M_String_Array_set(M_String_Array *lvalue, M_String_Array rvalue);

User Marking API

When a client’s data structure has a lot of M-Strings, it may be more efficient for them to list the strings they have during collection than to maintain continouus reference counts.

In this case the object should register a “marker” function. The function is always called with a user data structure registered with the function and a M_Root_Set. The Marker should add all M-Strings it currently uses to this Root Set. The M-Strings will then escape another collection cycle.

typedef void (*M_Marker)(void* data, M_Root_Set set);

void M_Root_Set_add(M_Root_Set set, M_String s);

void M_Pool_add_marker(M_Pool pool, void* ptr, M_Marker mfcn);

This might resemble the “userdata” type in the Lua C API. ↩︎
I’ve speculated on allowing arbitrary glyphs, implemented as bitmaps or font references, but for now I’ll forego that kind of freedom. ↩︎
It’s best to initialize M-String variables to avoid false positives in case the old stack value accidentally corresponds to a live M-String. ↩︎