Teufel Notes

Frank Mitchell

Posted: 2024-11-24
Last Modified: 2024-11-24
Word Count: 3481
Tags: c-programming programming vaporware

Table of Contents

Related Projects:

These are random notes on my pie-in-the-sky language project, Teufel, including excerpts from other blog posts.

Someday, I hope, I’ll look back at this and laugh.


Excerpted from Unwanted Software Thoughts Syndrome:

Teufel

Well over a decade ago, I read Bertrand Meyer’s Object Oriented Software Construction, and thought it the bee’s knees. Since then, I’ve had experience with functional languages, dynamically-typed languages, and distributed computing. Meyer’s approach, and the Eiffel language, is one way to construct systems. As a whole, though, it’s a slow, brittle, and sometimes over-complicated way. Eiffel’s design also reflects not only Meyer’s personal preferences (unsurprisingly) but an earlier era where memory and processor cycles were much more scarce. Scripting languages like Lua, Ruby, and (ugh) JavaScript provide a complementary construction technique: less efficient but more flexible glue between Eiffel style “reliable components”. The nature of the Web and distributed computing also argues for a more function-oriented approach. A lot of Web programming is transforming GET and POST requests into HTML (or, lately, JSON) responses, with a few side effects.

Also, all the free Eiffel compilers are broken, and Meyer’s company offers only crippled “evaluation” versions.

So in a fit of hubris I decided to write my own language, Teufel, with the following features:

That’s the grand plan, anyway. However, it’s probably better I do this in baby steps.

M-Strings

A hand-written library for memory managed strings in C. Yes, C++ has ref-counted strings, but a) I dislike C++ and b) I’d like some hands-on experience with automatic memory management. One of the major pain points in working with straight C is memory management in general and strings in particular. I might as well start with something useful.

Teufel Interface Definition Language (TIDL)

An IDL that is essentially the datatype and interface parts of the full Teufel language, with an accompanying code generator. It will generate stub and skeleton code in C much like CORBA. An extension will generate skeleton code for modules in Ruby, Lua, Python, and other scripting languages, not unlike SWIG.

M-Lib/T-Lib

A library of other useful data structures, probably developed as a side effect of M-Strings and using similar memory managed techniques.

Teufel Compiler, Mark 1

Combining TIDL with a language for implementing functions and objects, we now have a Turing-complete language for standalone programs. The first version will probably be an executable JAR file, both for portability and to take advantage of ANTLR. Finding a workable compromise between high-level scripting operations and low-level bit-crunching may require numerous iterations, and being able to change the parser easily will be a great help. Later I’ll worry about writing a Teufel compiler in Teufel.

Teufel Compiler, Mark 2

Once I’ve got a workable language, the next issue is the mapping to C. Eiffel’s official mechanism for incorporating C libraries is kind of a mess, and I’m not sure the SmartEiffel/LibreEiffel implementation is much better. I prefer the Lua model: a reentrant C API into the Teufel “runtime”. Rather than Lua’s stack discipline, though, I’d rather generate “object oriented” C methods. E.g. something like this:

r = obj.some_routine(a, b, c)

turns into this:

int errcode = Interface_Name_some_routine(obj, a, b, c, &r);

obj is an opaque reference value, which may or may not encode a pointer. Similar methods would wrap data types, primarily so programmers can’t accidentally or intentionally change their value.

Teufel Compiler, Mark 3+

Then it’s time to deliver on other features, such as autogenerated server code, multithreaded server configuration files, output in Java or scripting languages (perhaps for proxies), output as native modules in Java or scripting languages, and control of all options through a single ACT file. As syntax and semantics stabilize, I’ll retire the ANTLR-based compiler in favor of one written entirely in Teufel.

Schedule

Expect the first version of Teufel … well, don’t expect any version. There’s the software you write, and then there’s the software you talk about.


Excerpted from Teufel Again, with modifications.

The Teufel Type System

The Teufel type system would include these not quite disjoint types.

Any

Any is the supertype of all runtime entities below. If a type is declared Any, programs would need to narrow the type before they could use the value.

Depending on the level of runtime checking, all runtime entities have a Type_Def object that details the types of input parameters and output parameters for Datatypes, Routines, and Object messages, including concrete bindings for all type variables.

Any type and subtypes

None

None is the type with no values. It’s equivalent to void in C or None in Python. If the syntax requires a type but a Routine returns nothing, it’s implicitly or explicitly None.

The implementation may use a placeholder value equivalent to an empty Tuple or empty List in some cases. It will not have a single explicit null, nil, None, or undefined value; that will be left to Datatypes.

Datatype

Datatypes represent immutable data structures with public members. Syntactically a Datatype can only consist of references to Objects or other datatypes. Some may be “polymorphic” in the sense that a C union is polymorphic; all alternate forms are defined in the same spot.

Built-in Datatypes might include:

Except for Tuple, all of these (hopefully) can be defined within the language itself.

Because they’re immutable, threads can copy or pass datatypes, both their values and their metadata, freely among themselves.

Datatype type

Atomic Datatypes

Important subtypes of Datatype includes built-in “atomic types”, including:

The Teufel syntax might allow derived types that restrict an Atomic Datatype to certain ranges of type as in Pascal, most notably numbers. At runtime, though, operations on any atomic types produces a value general enough to accomodate the result.

Atomic Datatypes

Object

Object types have an identity and contain mutable data. An Object’s full type is defined by exactly one Class and one or more Protocols. Built-in Object types (Classes) include:

Because they have identity and mutable state, all object instances stay local to only one thread to reduce synchronization. Migrating between threads requires them to serialize their state, reinstantiate themselves on a new thread, and which most applications would not need or want.

Object type

Routine

Routine types have a signature of input types and output types. (Plural). Routines may be further categorized into:

  1. Pure Functions, which manipulate only immutable types and must have a return value.
  2. Predicates, which return only a boolean value.
  3. Functions which return at least one value.
  4. Procedures, which return no values.

(Note these categories are non-exclusive: A Routine may be a Pure Function, a Predicate, and a Function.)

Each Routine may also have:

Why preconditions? Multiple routines might have the same signature, yet take very different data values. Design by Contract, while somewhat redundant with software unit testing, attaches assertions to the code itself rather than to external code. The code to test assertions can be disabled at compile-time or runtime.

All threads share the code and metadata of Routines, or at least as much as they need to.

Routine type

Protocol

A Protocol consists of a name, zero or more inherited Protocols, zero or more Constants, and a list of Message definitions. Each Message definition, in turn, consists of:

Each Protocol also includes zero or more invariants, which are preconditions and postconditions on an implementation’s entire observable state.

Semantically, any Routine used to implement a Message must conform to the preconditions and postconditions. They may widen the preconditions and narrow the postconditions, but not the other way around.

Preconditions, postconditions, and invariants must be stored in the Type_Def as a parse tree of a boolean expression, so that clients using only a compiled library can insert those assertions into their own code.

All threads share the metadata of Protocols, although keeping multiple distributed processes in sync can be a challenge.

Protocol type

Class

A Class, as stated previously, consists of one or more Protocols, zero or more generic type parameters (bound or filled), a set of Routines to create Class instances, and (effectively) a table that maps a Message to a “Feature”3, which is either a Routine or an instance variable

Note that classes do not inherit from each other. At all. This forces programmers to choose composition over inheritance … a bit draconian, but it makes implementation much easier.4

All Classes have at least one Protocol. If they do not explicitly inherit from one, the syntax will allow the source code to designate certain features “public”, creating a Protocol with the same name as the Class. Otherwise, all class features are “private”, accessed directly or indirectly through Protocol messages.5

All threads keep the exact implementation of Classes private, although in practice threads will share Class metadata and code.

Class type


The Plan

Rereading my earlier blathering, I still think the following path is the best way to sneak up on this beast, if that’s what I want:

  1. Design a CORBA-lite/Objective-C-esque object model for Object, Protocol, and Class portions of Teufel.

  2. Write a compiler that generates Protocol code for said object model. Protocols intially will be implemented in another language, using glue code and C header files.

  3. Write a runtime for the “atomic types”: String, Number, Timestamp, etc. Also implement some sort of garbage collection, which might be as simple as linking in the BDW collector.

  4. Extend the compiler to express Datatypes and declare Routines implemented in another language.

  5. Add assertions to Protocols and Routine declarations.

  6. Since I’ll have to add expressions in order to implement assertions, I can hopefully extend the language to write functions and Routines that operate on Datatypes, Object references, and Routines.

  7. At that point I can worry about how to implement Classes completely in Teufel, as well as a clean mechanism to call out of Teufel into C (and maybe Rust).

  8. Close the holes in my type system using generics and a mini-language for expressing types, much like TypeScript and Java do. (Minus the type erasure: data structures and especially objects will treat type parameters as actual parameters, and by default check types every time a Datatype is created or an Object is mutated.)

  9. With all that, I can reimplement a compiler in Teufel, which is apparently a milestone in Turing completeness or some such.

  10. At that point I can revisit the threading / distributed execution implementation, implement a packaging and module system, and do all the other bits that would make Teufel not just another language. (Maybe I will have done Lua-CSP byt that time, so I can just replace the Lua interpreters with Teufel nodes.)

If I pursue this course, it will probably take me, at my current rate … a decade? A decade and a half? For something nobody asked for?


The Runtime Model

Teufel will compile to machine code, probably by first rendering the program into C. (Or sometimes Java. Maybe Rust?) Unlike C++ and other low-level object-oriented languages, the runtime will include enough structures to allow for higher-order functions, type-safe generic datatypes, and dynamic dispatch on objects.

Runtime Any Tags

Each datatype instance will carry a “tag” – an index to a metadata table – identifying what it is. For type safety, each routine manipulating a datatype can check that tag to determine how to handle it, if the routine should be handling it at all. All Routines will have a similar tag to indicate the runtime should check its metadata for arguments and results. All Objects will have tags denoting the specific Object_Def metadata that resolves that instance’s generic type parameters for all messages it supports.

Tags and type-checking can be turned off with a compiler flag to optimize the code, at some cost to safety.

Runtime Datatypes

Datatypes will be implemented internally as C-style structs or primitive types in the underlying implementation language.

Runtime Numbers

Since the full numeric stack, not to mention Bit_Set and Timestamp, may require overloading some operators, the language and runtime will have to support one of the following solutions:

Runtime Routines

Routines will push tagged datatypes onto a call stack (not necessarily the C stack), manipulate them with compiled code, and push the results onto the same stack. (One could also use a virtual machine here, but I’d rather compile to real machine code.)

Routines will also include closures, which capture the definitions of variables lexically defined outside them but used in the procedures. The runtime will also support what Python calls “partials”, that is derived functions based on an existing function with arguments filled in.

In Teufel, Routines will be first class values.

Runtime Objects, Classes, and Protocols

To a certain extent objects in Teufel will resemble objects in Objective-C, Python, Ruby, and many other dynamic object-oriented languages. Each Object instance has a pointer to its Class. When a message comes in, the Class locates the correct implementation of the message, marshals arguments onto the call stack and calls the appropriate Routine or derefrences an instance variable. Once the call completes, it places the result(s) on the call stack, and continues execution of the caller.

In later versions of the compiler, it will detect that certain classes or instances are unreachable from outside a module and optimize this hashtable lookup into pure machine code using simple conditionals, as if it were a Datatype.

Runtime Processes and Threads

Every thread boundary, process boundary, or foreign runtime (e.g. C to Rust or C to Python) forms a Domain on either side of the boundary. All messages, datatypes, routine references, and object references have to be marshalled across the boundary through an interthread, interprocess, or runtime-based Channel. (Routines themselves cannot be marshalled unless the code is already present on both sides of the channel.)

To at least detect Protocol version mismatches, if not adapt to them, all messages will contain metadata, rather than a solid block of bytes. When serialized over a Channel messages, datatypes, and object/routine references will be translated into JSON, BSON, MessagePack or a similar interoperable format, translating structures as necessary. For example, JSON “Arrays” will replace tuples and lists, and JSON “Objects” will replace the rest of the struct according to the data structure’s metadata.

Since only Objects may reside on different threads, the Message Call syntax may be augmented to specify that a message is oneway or that the sender will only wait only a specified time for a response. The type checker will need to be aware whether an object reference is a remote object or not, maybe through a special keyword. Worst case, we may have to use two different syntaxes for intra-Domain and inter-Domain communication, e.g. JavaScript’s async/await.


  1. After Eiffel’s ACE (Assembling Classes in Eiffel) file. ↩︎

  2. A value that can be encoded in a pointer, rather than having a pointer to heap (or stack?) memory. (A trick from Ruby, and undoubtedly many Scheme and Smalltalk interpreters.) ↩︎ ↩︎ ↩︎ ↩︎

  3. A term borrowed from Eiffel, Teufel’s sort of namesake. ↩︎

  4. One reason C++ puts member variables in its header files is that anything that inherits from a C++ class has to reserve room for the superclass variables. (Objective-C had the same problem.) With no implementation inheritance, that problem goes away. Teufel datatypes would compile into C as structs, protocols and classes into runtime data structures with associated code, and object instances into mutable structs with a pointer to the class dispatch table. ↩︎

  5. Java classes, for example, also have protected and package private members to define access to class members from subclasses and from members of the same package. I’m disallowing subclasses, so the first is not needed. As for the second, I’d prefer to restrict visibility through an external “module” system analogous to Java 9 modules over something like Java packages or Eiffel’s potentially chaotic visibility through naming specific classes in the exporting code itself. ↩︎