Related Projects:
These are random notes on my pie-in-the-sky language project, Teufel, including excerpts from other blog posts.
Someday, I hope, I’ll look back at this and laugh.
Excerpted from Unwanted Software Thoughts Syndrome:
Teufel
Well over a decade ago, I read Bertrand Meyer’s Object Oriented Software Construction, and thought it the bee’s knees. Since then, I’ve had experience with functional languages, dynamically-typed languages, and distributed computing. Meyer’s approach, and the Eiffel language, is one way to construct systems. As a whole, though, it’s a slow, brittle, and sometimes over-complicated way. Eiffel’s design also reflects not only Meyer’s personal preferences (unsurprisingly) but an earlier era where memory and processor cycles were much more scarce. Scripting languages like Lua, Ruby, and (ugh) JavaScript provide a complementary construction technique: less efficient but more flexible glue between Eiffel style “reliable components”. The nature of the Web and distributed computing also argues for a more function-oriented approach. A lot of Web programming is transforming GET and POST requests into HTML (or, lately, JSON) responses, with a few side effects.
Also, all the free Eiffel compilers are broken, and Meyer’s company offers only crippled “evaluation” versions.
So in a fit of hubris I decided to write my own language, Teufel, with the following features:
-
Static, explicit typing … mostly.
-
Pre- and post-conditions like Eiffel’s “Design by Contract”, with some extensions.
-
A strict line between “data” types with value semantics and “object” types with reference semantics, mutable state, and identity.
-
Standalone functions as first-order types.
-
Interface inheritance without implementation inheritance. To reuse implementations, wrap them in functions and use them in class routines.
-
An “interface” sub-language combining the aforementioned conditions with ideas from IDL and functional programming. Notably, SML-style “datatypes” would be part of the interface.
-
Conventions to make writing implementation classes easier. For example, if an interface specifies
foo
as an immutable list of type X, and an instance variablefoo
is a mutable collection of type X, then that satisfies the interface requirement, as collections automatically convert into lists of their contents. -
Partial compilation, which the free Eiffel compilers I’ve seen couldn’t do. A Teufel module or package consists of a JSON file detailing the package’s public types, interfaces, and functions, plus C header files, compiled libraries, Java JAR files, scripts, or whatever else a target language requires to build executable code.
-
Instead of standard Make, Teufel primarily uses an ACT file1 which details:
- whether the final product is a standalone program or a package
- if a standalone program, what its “main” function is, or if a server what remote objects it exports and with what protocol(s).
- the target language(s) of generated code (default: C).
- options for mapping Teufel types to target language constructs.
- other compile-time options, e.g. library paths, external libraries, and compiler flags for C.
That’s the grand plan, anyway. However, it’s probably better I do this in baby steps.
M-Strings
A hand-written library for memory managed strings in C. Yes, C++ has ref-counted strings, but a) I dislike C++ and b) I’d like some hands-on experience with automatic memory management. One of the major pain points in working with straight C is memory management in general and strings in particular. I might as well start with something useful.
Teufel Interface Definition Language (TIDL)
An IDL that is essentially the datatype and interface parts of the full Teufel language, with an accompanying code generator. It will generate stub and skeleton code in C much like CORBA. An extension will generate skeleton code for modules in Ruby, Lua, Python, and other scripting languages, not unlike SWIG.
M-Lib/T-Lib
A library of other useful data structures, probably developed as a side effect of M-Strings and using similar memory managed techniques.
Teufel Compiler, Mark 1
Combining TIDL with a language for implementing functions and objects, we now have a Turing-complete language for standalone programs. The first version will probably be an executable JAR file, both for portability and to take advantage of ANTLR. Finding a workable compromise between high-level scripting operations and low-level bit-crunching may require numerous iterations, and being able to change the parser easily will be a great help. Later I’ll worry about writing a Teufel compiler in Teufel.
Teufel Compiler, Mark 2
Once I’ve got a workable language, the next issue is the mapping to C. Eiffel’s official mechanism for incorporating C libraries is kind of a mess, and I’m not sure the SmartEiffel/LibreEiffel implementation is much better. I prefer the Lua model: a reentrant C API into the Teufel “runtime”. Rather than Lua’s stack discipline, though, I’d rather generate “object oriented” C methods. E.g. something like this:
r = obj.some_routine(a, b, c)
turns into this:
int errcode = Interface_Name_some_routine(obj, a, b, c, &r);
obj
is an opaque reference value, which may or may not encode a pointer.
Similar methods would wrap data types, primarily so programmers can’t
accidentally or intentionally change their value.
Teufel Compiler, Mark 3+
Then it’s time to deliver on other features, such as autogenerated server code, multithreaded server configuration files, output in Java or scripting languages (perhaps for proxies), output as native modules in Java or scripting languages, and control of all options through a single ACT file. As syntax and semantics stabilize, I’ll retire the ANTLR-based compiler in favor of one written entirely in Teufel.
Schedule
Expect the first version of Teufel … well, don’t expect any version. There’s the software you write, and then there’s the software you talk about.
Excerpted from Teufel Again, with modifications.
The Teufel Type System
The Teufel type system would include these not quite disjoint types.
Any
Any is the supertype of all runtime entities below.
If a type is declared Any
, programs would need to narrow the type
before they could use the value.
Depending on the level of runtime checking, all runtime entities have a Type_Def object that details the types of input parameters and output parameters for Datatypes, Routines, and Object messages, including concrete bindings for all type variables.
None
None is the type with no values. It’s equivalent to void
in C
or None
in Python. If the syntax requires a type but a Routine returns
nothing, it’s implicitly or explicitly None
.
The implementation may use a placeholder value equivalent to an empty Tuple
or empty List in some cases. It will not have a single explicit null
,
nil
, None
, or undefined
value; that will be left to Datatypes.
Datatype
Datatypes represent immutable data structures with public members.
Syntactically a Datatype can only consist of references to Objects
or other datatypes. Some may be “polymorphic” in the sense that a C union
is polymorphic; all alternate forms are defined in the same spot.
Built-in Datatypes might include:
- List[T], which has the subtypes List.Empty and List.Cons(T, List[T]). It represents a (linked) sequence of immutable data.
- Option[T], which has the subtypes Option.None and Option.Some[T]. It represents an immutable value which may be “null”.
- Result[T, E], which has the subtypes Result.Ok[T] and Result.Err[string, E]. It represents a return value which may result in an error.
- Tuple[T1, T2, …], which has innumerable subtypes. It represents a fixed-sized sequence of immutable data.
Except for Tuple, all of these (hopefully) can be defined within the language itself.
Because they’re immutable, threads can copy or pass datatypes, both their values and their metadata, freely among themselves.
Atomic Datatypes
Important subtypes of Datatype includes built-in “atomic types”, including:
-
Bit_Set: a potentially unlimited set of “on” bits.
-
Boolean: either
true
orfalse
, as an immediate2 value. -
Number: numeric values, variously represented internally as:
- Complex: a complex number composed of two Real values (one of the subtypes below)
- Integer: an integral number, further subtyped as:
- Big_Integer: an infinite precision integer using a popular library.
- Fixed_Integer: a fixed width integer, implemented as an immediate2 value.
- Decimal: a decimal value implemented as two Integer values.
- Float: a floating point number.
- Rational: a ratio of two Integer values.
-
String: Immutable Unicode strings, with the basic operations of concatenation, indexing, and Python-like slicing. Notionally, as in Python, each multi-character String is made up of one-character Strings. Internally the implementation will use the following subtypes:
- Empty_String: an empty string, implemented as an immediate2 value.
- Character_String: a single unicode character, implemented as an immediate2 value.
- UTF_8_String: a sequence of Unicode characters encoded in UTF-8.
- UTF_16_String: a sequence of Unicode characters encoded in UTF-16.
- UTF_32_String: a sequence of Unicode characters encoded as code points.
-
Timestamp: A distinct UNIX-like timestamp type, because a) every language acquires a “date”, “time”, or “datetime” type sooner or later, but b) dates and calendars are hard. Timestamps can only be subtracted from each other, yielding a Number of seconds, or added to a Number yielding another Timetamp.
The Teufel syntax might allow derived types that restrict an Atomic Datatype to certain ranges of type as in Pascal, most notably numbers. At runtime, though, operations on any atomic types produces a value general enough to accomodate the result.
Object
Object types have an identity and contain mutable data. An Object’s full type is defined by exactly one Class and one or more Protocols. Built-in Object types (Classes) include:
- Ref[T], a mutable pointer that dereferences to Option[T].
- Array[T], a mutable, resizable sequence of elements of type T.
- Table[K,V], a mutable, resizable mapping from a key of type K to a value of type V. If K is an Object type, the Table hashes the identity of the object, not its value.
Because they have identity and mutable state, all object instances stay local to only one thread to reduce synchronization. Migrating between threads requires them to serialize their state, reinstantiate themselves on a new thread, and which most applications would not need or want.
Routine
Routine types have a signature of input types and output types. (Plural). Routines may be further categorized into:
- Pure Functions, which manipulate only immutable types and must have a return value.
- Predicates, which return only a boolean value.
- Functions which return at least one value.
- Procedures, which return no values.
(Note these categories are non-exclusive: A Routine may be a Pure Function, a Predicate, and a Function.)
Each Routine may also have:
- zero or more preconditions, assertions on the routine’s arguments that must all must pass for the Routine to produce sensible results.
- zero or more postconditions, assertions on the routine’s arguments and results that all must pass to verify the Routine produced the expected results.
Why preconditions? Multiple routines might have the same signature, yet take very different data values. Design by Contract, while somewhat redundant with software unit testing, attaches assertions to the code itself rather than to external code. The code to test assertions can be disabled at compile-time or runtime.
All threads share the code and metadata of Routines, or at least as much as they need to.
Protocol
A Protocol consists of a name, zero or more inherited Protocols, zero or more Constants, and a list of Message definitions. Each Message definition, in turn, consists of:
- a name, implemented as a unique string
- a signature, as defined above.
- zero or more preconditions, as defined above.
- zero or more postconditions, as defined above.
Each Protocol also includes zero or more invariants, which are preconditions and postconditions on an implementation’s entire observable state.
Semantically, any Routine used to implement a Message must conform to the preconditions and postconditions. They may widen the preconditions and narrow the postconditions, but not the other way around.
Preconditions, postconditions, and invariants must be stored in the Type_Def as a parse tree of a boolean expression, so that clients using only a compiled library can insert those assertions into their own code.
All threads share the metadata of Protocols, although keeping multiple distributed processes in sync can be a challenge.
Class
A Class, as stated previously, consists of one or more Protocols, zero or more generic type parameters (bound or filled), a set of Routines to create Class instances, and (effectively) a table that maps a Message to a “Feature”3, which is either a Routine or an instance variable
Note that classes do not inherit from each other. At all. This forces programmers to choose composition over inheritance … a bit draconian, but it makes implementation much easier.4
All Classes have at least one Protocol. If they do not explicitly inherit from one, the syntax will allow the source code to designate certain features “public”, creating a Protocol with the same name as the Class. Otherwise, all class features are “private”, accessed directly or indirectly through Protocol messages.5
All threads keep the exact implementation of Classes private, although in practice threads will share Class metadata and code.
The Plan
Rereading my earlier blathering, I still think the following path is the best way to sneak up on this beast, if that’s what I want:
-
Design a CORBA-lite/Objective-C-esque object model for Object, Protocol, and Class portions of Teufel.
-
Write a compiler that generates Protocol code for said object model. Protocols intially will be implemented in another language, using glue code and C header files.
-
Write a runtime for the “atomic types”: String, Number, Timestamp, etc. Also implement some sort of garbage collection, which might be as simple as linking in the BDW collector.
-
Extend the compiler to express Datatypes and declare Routines implemented in another language.
-
Add assertions to Protocols and Routine declarations.
-
Since I’ll have to add expressions in order to implement assertions, I can hopefully extend the language to write functions and Routines that operate on Datatypes, Object references, and Routines.
-
At that point I can worry about how to implement Classes completely in Teufel, as well as a clean mechanism to call out of Teufel into C (and maybe Rust).
-
Close the holes in my type system using generics and a mini-language for expressing types, much like TypeScript and Java do. (Minus the type erasure: data structures and especially objects will treat type parameters as actual parameters, and by default check types every time a Datatype is created or an Object is mutated.)
-
With all that, I can reimplement a compiler in Teufel, which is apparently a milestone in Turing completeness or some such.
-
At that point I can revisit the threading / distributed execution implementation, implement a packaging and module system, and do all the other bits that would make Teufel not just another language. (Maybe I will have done Lua-CSP byt that time, so I can just replace the Lua interpreters with Teufel nodes.)
If I pursue this course, it will probably take me, at my current rate … a decade? A decade and a half? For something nobody asked for?
The Runtime Model
Teufel will compile to machine code, probably by first rendering the program into C. (Or sometimes Java. Maybe Rust?) Unlike C++ and other low-level object-oriented languages, the runtime will include enough structures to allow for higher-order functions, type-safe generic datatypes, and dynamic dispatch on objects.
Runtime Any
Tags
Each datatype instance will carry a “tag” – an index to a metadata table –
identifying what it is. For type safety, each routine manipulating a
datatype can check that tag to determine how to handle it, if the routine
should be handling it at all. All Routines will have a similar tag to
indicate the runtime should check its metadata for arguments and results.
All Objects will have tags denoting the specific Object_Def
metadata that
resolves that instance’s generic type parameters for all messages it supports.
Tags and type-checking can be turned off with a compiler flag to optimize the code, at some cost to safety.
Runtime Datatypes
Datatypes will be implemented internally as C-style struct
s or primitive
types in the underlying implementation language.
Runtime Numbers
Since the full numeric stack, not to mention Bit_Set
and Timestamp
,
may require overloading some operators, the language and runtime will have
to support one of the following solutions:
-
A Numeric Protocol, which allows any Class to behave as a
Number
by implementing a protocol. This breaks the separation between Datatypes and Objects, however, so I’m not in favor of it. -
Operator Overloading, in which a datatype can redefine the meaning of mathematical operators involving instances of itself and other Numeric types. This, however, can lead to travesties like C++ overloading the bit-shift operator to send data to a stream.
-
Explicit metatables like Lua, in which the datatype registers a function to act as its implementation or “+”, “-”, “<”, etc. This mechanism will be awkward, and may lead to ambiguity, e.g. if both
Complex
andRational
both have their own metatable entries for “+”. -
Special cases for built-in Numeric types, which will be added as the implementation progresses. This will be painful, particularly since some “atomic” datatypes like
Bit_Set
,Complex
,Rational
, andTimestamp
could be implemented entirely in a sufficiently advanced Teufel language. -
Promotion rules in which a
Fixed_Integer
becomes aBig_Integer
, anInteger
becomes aDecimal
, aDecimal
becomes aRational
, etc.Float
drops a fly into this ointment, since it can’t properly convert into aRational
. Do we need to add Scheme’s “exact” and “inexact” markers? Yikes. -
Demote
Decimal
,Rational
, andComplex
to ordinary Datatypes which rely on functions likedecimal_add
,decimal_subtract
, etc. plus explicit conversions. This is extremely awkward.
Runtime Routines
Routines will push tagged datatypes onto a call stack (not necessarily the C stack), manipulate them with compiled code, and push the results onto the same stack. (One could also use a virtual machine here, but I’d rather compile to real machine code.)
Routines will also include closures, which capture the definitions of variables lexically defined outside them but used in the procedures. The runtime will also support what Python calls “partials”, that is derived functions based on an existing function with arguments filled in.
In Teufel, Routines will be first class values.
Runtime Objects, Classes, and Protocols
To a certain extent objects in Teufel will resemble objects in Objective-C, Python, Ruby, and many other dynamic object-oriented languages. Each Object instance has a pointer to its Class. When a message comes in, the Class locates the correct implementation of the message, marshals arguments onto the call stack and calls the appropriate Routine or derefrences an instance variable. Once the call completes, it places the result(s) on the call stack, and continues execution of the caller.
In later versions of the compiler, it will detect that certain classes or instances are unreachable from outside a module and optimize this hashtable lookup into pure machine code using simple conditionals, as if it were a Datatype.
Runtime Processes and Threads
Every thread boundary, process boundary, or foreign runtime (e.g. C to Rust or C to Python) forms a Domain on either side of the boundary. All messages, datatypes, routine references, and object references have to be marshalled across the boundary through an interthread, interprocess, or runtime-based Channel. (Routines themselves cannot be marshalled unless the code is already present on both sides of the channel.)
To at least detect Protocol version mismatches, if not adapt to them, all messages will contain metadata, rather than a solid block of bytes. When serialized over a Channel messages, datatypes, and object/routine references will be translated into JSON, BSON, MessagePack or a similar interoperable format, translating structures as necessary. For example, JSON “Arrays” will replace tuples and lists, and JSON “Objects” will replace the rest of the struct according to the data structure’s metadata.
Since only Objects may reside on different threads, the Message Call syntax may be augmented to specify that a message is oneway or that the sender will only wait only a specified time for a response. The type checker will need to be aware whether an object reference is a remote object or not, maybe through a special keyword. Worst case, we may have to use two different syntaxes for intra-Domain and inter-Domain communication, e.g. JavaScript’s async/await.
-
After Eiffel’s ACE (Assembling Classes in Eiffel) file. ↩︎
-
A value that can be encoded in a pointer, rather than having a pointer to heap (or stack?) memory. (A trick from Ruby, and undoubtedly many Scheme and Smalltalk interpreters.) ↩︎ ↩︎ ↩︎ ↩︎
-
A term borrowed from Eiffel, Teufel’s sort of namesake. ↩︎
-
One reason C++ puts member variables in its header files is that anything that inherits from a C++ class has to reserve room for the superclass variables. (Objective-C had the same problem.) With no implementation inheritance, that problem goes away. Teufel datatypes would compile into C as
struct
s, protocols and classes into runtime data structures with associated code, and object instances into mutablestruct
s with a pointer to the class dispatch table. ↩︎ -
Java classes, for example, also have
protected
andpackage private
members to define access to class members from subclasses and from members of the same package. I’m disallowing subclasses, so the first is not needed. As for the second, I’d prefer to restrict visibility through an external “module” system analogous to Java 9 modules over something like Java packages or Eiffel’s potentially chaotic visibility through naming specific classes in the exporting code itself. ↩︎