Purpose
The M-String library provides C with two features present in higher-level languages:
- strings that are full-featured objects.
- memory management.
Right now memory management is limited to said strings, but a future release will provide a general Memory Manager that can handle arbitrary blocks of memory.1 The author chose strings as a test case because a garbage collector for strings is almost trivial, the lack of memory-managed strings was a particular pain point of the author’s, and if the interface between managed and un-managed code worked for ubiquitous strings it would work for anything.
Concepts
Polymorphic Strings
Notionally an M_String
contains a sequence of characters, realized as
Unicode code points.2 Since the largest code point defined to date
is 0x10FFFF, each character would take 21 bits to represent, or for efficiency
an entire 32-bit word. That’s way too much.
Instead the code represents strings in multiple ways:
- emptystring: The empty string is an immediate value. That is, instead of pointing to a data structure, its identity is encoded in the pointer itself.
- smallstring: Each string with one character is also an immediate value represented within 32 bits.
- utf8string: Most multi-character strings use UTF-8, a variable encoding that represents ASCII characters 0-127 as one byte and all currently defined characters with no more than four bytes.
- encodedstring: Strings read in with a character set besides Unicode or ASCII remain in that encoding, tagged with the name of the character set. Depending on the translation, the implementation may generate a parallel UTF-8 sequence or simply convert on the fly. For example, “ISO-Latin-1” is simply the first 256 elements of Unicode stored in single bytes. Another custom library, CConv performs all character conversions.
From the outside, though, all strings look like a sequence of Unicode
code points. One can only tell a string’s format from its length and its
encoding
attribute (“UTF-32” with length ≤ 1, “ASCII” or “UTF-8”, or other).
Immutable Strings
The author chose to implement the immutable strings of Java, Lua, Python, and various functional programming languages instead of the mutable strings of C, Eiffel, and Ruby for a few reasons:
-
Immutable objects can be passed between threads without synchronization. In theory, at least; see below.
-
Given their ubiquity, one wants strings to have the same value semantics as numbers, booleans, and other common types. Here, “value semantics” means that the data is an anonymous bit-pattern in an infinite space of bit-patterns, not an “object” with mutable state and behavior. One can change a variable’s value from three to four, but one can’t change “four” to mean
5
. (Except in Fortran sometimes.) -
Many times one wants a string value to behave as a constant. Mutable strings must implement a way to “freeze” the string, which disables all routines that change the string. The author believes this design pattern is sub-optimal:
-
Often the API to deal efficiently with the string requires changing it in place. A programmer must therefore copy the immutable string to a mutable one before changing it.
-
If called, disabled methods either throw an exception or do nothing. In a statically-typed language using dynamic typing to signal that an operation isn’t possible, either quietly or loudly, seems fundamentally wrong.
-
Often there’s no way to tell whether an object is mutable save by attempting to mutate it. Even if there is an
is_mutable
query both mutable and immutable versions have the same class and can substitute for each other … even though they really can’t.
The author prefers the pattern for Java string-related classes: all classes – String, StringBuffer, StringBuilder, etc. – implement the
CharSequence
interface for their common operations, but String is immutable. All others exist to build Strings or otherwise substitute for them in specific instances. -
One can implement correct if not always efficient operations through the basic operations of inspecting, slicing, and joining strings. The implementation also handles the empty string and strings of one character efficiently. (Or it will do so at some point.) Future versions will use the ropes or cords approach of representing long strings as trees of sliced and joined segments.
For more complicated transformations not provided in the library,
M_Char_Buffer
may satisfy some but not all needs.
The author suggests the M_String_to_{byte,utf8,utf32}_array
functions
which copy the contents to a buffer of UTF-8, UTF-32, or internally encoded
bytes. From there one can reverse them, parse them, or do whatever else
needed with C-like efficiency then create a new string with the results.
Memory Management
Unlike most structures in C which require manual memory management, the point of M-Strings is that users can create them freely and depend on the M-Pool to remove them when they’re no longer needed. In order to do its work, however, an M-Pool needs cooperation from application code. This cooperation comes from reference counting and marking.
Reference Counting
Whenever the application stores an M-String in a variable, it should “retain” the string like so:
M_String s = M_String_retain(M_String_new_from_cstring(pool, "foo"));
Likewise, when the application no longer needs it, it should “release” it:
MString s = M_String_release(s):
The ...set
function handles reference counts correctly.3
M_String s = M_REF_NULL; /* 0 will also work */
M_String_set(&s, M_String_new_from_cstring(pool, "foo"));
One can check the exact reference count with M_String_references(...)
;
Marking
Once an M-String’s reference count goes to 0, the M-String isn’t immediately
collected. If it’s held in a Scope it won’t be collected until
that Scope ends. M-Pools also reserve collection until the application
creates another M-String or explicitly calls M_Pool_clean(...)
.
In both cases, the M-Pool does a full mark-and-sweep garbage collection cycle to verify that an M-String isn’t being used somewhere.
Applications that hang onto M-Strings without reference counting can participate in garbage collection with something like the following:
struct App_Struct {
...
M_String name;
...
}
void mark_app_name(App_Struct* app, M_Root_Set set) {
M_Root_Set_add(set, app->name);
}
/* somewhere in the initialization code */
App_Struct app;
M_Pool p;
...
app->name = M_String_new_from_cstring(pool, "MyApplication");
M_Pool_add_marker(p, app, &mark_app_name);
...
Scopes
Sometimes the application will create a bunch of temporary strings. To keep them from being reclaimed while the application is using them,
void parse_string(M_String_Array tokens, M_String text) {
M_Scope scope;
M_Scope_begin(&scope, M_String_pool(text));
/*
* Multiple instances of M_String slicing and joining redacted
* to avoid offending those of a delicate disposition.
*/
M_Scope_end(&scope);
}
Only the M-Strings appended to tokens
will survive garbage collection.
The rest will be collected shortly after M_Scope_end(&scope)
;
Multithreading
The default implementation of a Memory Pool assumes it is thread-local. That is, only one thread will use a Memory Pool or its M-Strings. To move strings between Pools, one can simply copy strings. Identical copies are, for all practical purposes, the same string.
A future release will implement a fully thread-safe Memory Pool usable for all strings in a multithreaded application.
API
The M-String API will look something like this.
Types
Note that this library uses CSymbol.
/* defined in C99 */
#include <stdbool.h> /* defines bool */
#include <stddef.h> /* defines int */
#include <stdint.h> /* defines uint8_t, utf16_t, uintptr_t */
#include <wchar.h> /* wchar_t */
#include "csymbol.h"
typedef uint8_t octet_t;
typedef uint8_t utf8_t;
typedef uint16_t utf16_t;
/*
* Types implemented as opaque references.
*/
typedef uintptr_t M_String;
const M_String M_REF_NULL = 0;
/*
* Types implemented as (hidden) structs.
*/
typedef struct _M_Char_Buffer* M_Char_Buffer;
typedef struct _M_Pool* M_Pool;
typedef struct _M_Root_Set* M_Root_Set;
typedef struct _M_Scope* M_Scope;
typedef struct _M_String_Array* M_String_Array;
Memory Pool
/**
* Create a reentrant Memory Pool to manage M-String memory
* and write the reference into `(*pptr)`.
* The current thread should hang onto this reference until it's about
* to exit, then call `M_Pool_close` on it.
*/
void M_Pool_new(M_Pool *pptr);
/**
* Collect all strings not in use.
* "In-use", here, means either with a positive reference count
* *or* referenced in managed memory and cooperating non-managed memory.
* Typically a thread will call this function in a main loop or other
* place where the application is waiting for user interaction.
*/
void M_Pool_clean(MPool pool);
/**
* Delete the Memory Pool and clear the reference in `(*pptr)`.
* This halts any calls to M_Markers and
* deallocates all M_Strings, M_Char_Buffers, and M_Scopes from this pool.
*/
void M_Pool_release(M_Pool *pptr);
Error
typedef enum _M_String_Error {
STATUS_OK = 0,
ERROR_NULL_POINTER,
ERROR_UNKNOWN_CHARSET,
ERROR_STRING_ENCODING,
/* more TBD ...*/
ERROR_UNDEFINED = 0xFFFF
} M_String_Error;
bool M_Pool_has_error(M_Pool pool);
M_String_Error M_Pool_error_code(M_Pool pool);
const char* M_Pool_error_detail(M_Pool pool);
String
Note that this library uses CSymbol.
Creating
All functions but one take a sz
parameter with the number of elements
to be converted to characters in an M-String.
...from_cstring
assumes a null-terminated string.
Each method returns M_REF_NULL
and sets the M-Pool’s error codes
if the arguments or the character encoding is invalid.
Otherwise they return a valid M-String.
M_String M_String_new_ascii(M_Pool pool, size_t sz, const char* buf);
M_String M_String_new_utf8(M_Pool pool, size_t sz, const utf8_t* buf);
M_String M_String_new_utf16(M_Pool pool, size_t sz, const utf16_t* buf);
M_String M_String_new_utf32(M_Pool pool, size_t sz, const wchar_t* buf);
M_String M_String_new_encoded(M_Pool pool,
C_Symbol encoding,
size_t sz,
const octet_t* buf);
/* Create from a null-terminated string */
M_String M_String_new_from_cstring(M_Pool pool, const char* cstr);
Inspecting
Provides the character at the given index, the internal character encoding of the string, and the length of the string in Unicode code points, respectively.
wchar_t M_String_char_at(M_String s, size_t i);
C_Symbol M_String_encoding(M_String s);
const char* M_String_encoding_name(M_String s);
size_t M_String_length(M_String s);
Provides the Memory Pool containing the string.
M_Pool M_String_pool(M_String s);
Copying
These functions copy part or all of a string into a buffer.
...to_byte
copies characters as bytes of the string’sencoding
....to_utf8
copies characters as a sequence of “UTF-8” bytes....to_utf32
copies characters as 32-bit Unicode code points.
Each takes an offest into the M-String,
the maximum number of bytes to copy,
and a pointer to a buffer (which should accomodate len
bytes).
All return the number of actual bytes copied; the functions will only copy complete characters in a multi-byte or variable length character set.
size_t M_String_to_byte(M_String s, size_t offset, size_t max, octet_t* buf);
size_t M_String_to_utf8(M_String s, size_t offset, size_t max, utf8_t* buf);
size_t M_String_to_utf32(M_String s, size_t offset, size_t max, wchar_t* buf);
Iterating
This calls the function iter
over each character, passing in the character,
its index, and the user-provided data
structure.
C lacks the ability to create functions within functions and the sort of
lexical scoping rules to make this API truly useful.
The data
pointer only partly compensates.
void M_String_each(M_String s,
void* data,
void (*iter)(void*, size_t, wchar_t));
Slicing
Slicing creates a substring starting at the character indicated by first
and ending on the character indicated by last
(inclusive).
The length of the string will be last
- first
,
assuming both are positive and last
>= first
.
If last
< first
, the length of the string will be 0.
If first
or last
are negative, they indicate positions relative
to the end of the string; subtract them from the length to get
their absolute positions. (-1 is the last character of the string.)
M_String M_String_slice(M_String s, int first, int last);
M_String M_String_slice_from(M_String s, int first);
M_String M_String_slice_to(M_String s, int last);
Joining
Joining strings creates a single string with the sub-strings concatenated together.
M_String M_String_join(M_String head, M_String tail);
M_String M_String_join_n(size_t n, ...);
Bookkeeping
Methods to perform reference counting. Adding/removing references may invisibly change the string’s internal state (e.g. copy-on-retain to convert an external buffer to an internal one) but shouldn’t change the string’s value.
is_live
returns whether the argument is still a usable object.
If false, the M-String has been deallocated and garbage collected.
references
returns the strings current reference count.
This may be 0 if the string is either newly created or was
just released within a Scope.
retain
increments the reference count and returns a possibly altered
reference.
The new reference may contain encoded information but it points to
the same M-String value (and usually the same internal object).
release
decrements the reference count and returns a possibly altered
reference.
Usually this is to preserve the string while it’s returned from a function.
set
is a convenience function to set (*lvalue)
to rvalue
while
managing reference counts correctly. It always returns rvalue
.
It’s roughly analogous to C++’s “smart pointers”, but less convenient.
bool M_String_is_live(M_String s);
unsigned int M_String_references(M_String s);
M_String M_String_retain(M_String s);
M_String M_String_release(M_String s);
M_String M_String_set(M_String *lvalue, M_String rvalue);
Scopes
When a client produces a lot of temporary M-Strings in a procedure, they may want to pause garbage collection of those strings until the procedure, or a critical block therein, completes. The client would demarcate a “scope” that pauses collection for those strings. When the scope ends, garbage collection of those strings resumes.
A variable of type M_Scope
represents the scope. It’s set when the
scope begins, and cleared when it ends.
Nested scopes behave as one would assume, but if one scope ends before one or more scopes that it should be nested within, all those scopes will end at once. This prevents scopes from surviving indefinitely if a programmer forgot to end a scope. Begin a scope at the top of a procedure or code block, and end it at the end of the procedure or code block, and all should be well.
void M_Scope_begin(M_Scope *scope, M_Pool pool);
void M_Scope_end(M_Scope *scope);
Char Buffer
An array of Unicode code points, or, colloquially, char
s.
void M_Char_Buffer_new(M_Char_Buffer *newref);
void M_Char_Buffer_new_from_string(M_Char_Buffer *newref, M_String str);
size_t M_Char_Buffer_length(M_Char_Buffer self);
wchar_t M_Char_Buffer_get(M_Char_Buffer self, int index);
M_String M_Char_Buffer_get_slice(M_Char_Buffer self,
int first,
int last);
void M_Char_Buffer_append(M_Char_Buffer self, wchar_t c);
void M_Char_Buffer_append_string(M_Char_Buffer self, M_String s);
void M_Char_Buffer_insert(M_Char_Buffer self, int index, wchar_t c);
void M_Char_Buffer_insert_string(M_Char_Buffer self, int index, M_String s);
void M_Char_Buffer_set(M_Char_Buffer self, int index, wchar_t c);
void M_Char_Buffer_set_slice(M_Char_Buffer self,
int first,
int last,
M_String str);
void M_Char_Buffer_set_string(M_Char_Buffer self, int index, M_String s);
M_String M_Char_Buffer_to_string(M_Char_Buffer self, M_Pool pool);
Unlike M-Strings, Char Buffers aren’t garbage collected, merely reference counted with the following functions. Once a String Buffer’s reference count dips to zero, it’s immediately reclaimed. It’s safer to pass it out of a function in a pointer passed into a function than return it from the function.
bool M_Char_Buffer_is_live(M_Char_Buffer b);
void M_Char_Buffer_retain(M_Char_Buffer *bptr);
void M_Char_Buffer_release(M_Char_Buffer *bptr);
void M_Char_Buffer_set(M_Char_Buffer *lvalue, M_Char_Buffer rvalue);
String Array
An array of Strings.
void M_String_Array_new(M_String_Array *newref);
size_t M_String_Array_size(M_String_Array self);
M_String M_String_Array_get(M_String_Array self, int index);
void M_String_Array_append(M_String_Array self, M_String s);
void M_String_Array_insert(M_String_Array self, int index, M_String s);
void M_String_Array_set(M_String_Array self, int index, M_String s);
Unlike M-Strings, String Arrays aren’t garbage collected, merely reference counted with the following functions. Once a String Array’s reference count dips to zero, it’s immediately reclaimed. It’s safer to pass it out of a function in a pointer passed into a function than return it from the function.
bool M_String_Array_is_live(M_String_Array a);
void M_String_Array_retain(M_String_Array *aptr);
void M_String_Array_release(M_String_Array *aptr);
void M_String_Array_set(M_String_Array *lvalue, M_String_Array rvalue);
User Marking API
When a client’s data structure has a lot of M-Strings, it may be more efficient for them to list the strings they have during collection than to maintain continouus reference counts.
In this case the object should register a “marker” function.
The function is always called with a user data structure registered with
the function and a M_Root_Set
. The Marker should add all M-Strings
it currently uses to this Root Set. The M-Strings will then escape
another collection cycle.
typedef void (*M_Marker)(void* data, M_Root_Set set);
void M_Root_Set_add(M_Root_Set set, M_String s);
void M_Pool_add_marker(M_Pool pool, void* ptr, M_Marker mfcn);
-
This might resemble the “userdata” type in the Lua C API. ↩︎
-
I’ve speculated on allowing arbitrary glyphs, implemented as bitmaps or font references, but for now I’ll forego that kind of freedom. ↩︎
-
It’s best to initialize M-String variables to avoid false positives in case the old stack value accidentally corresponds to a live M-String. ↩︎