Teufel Again | Frank Mitchell's Blog

MODIFIED (2024-11-22): Minor edits, further thoughts.

A video excerpt of Robert “Uncle Bob” Martin prompted some thoughts about the programming language I will probably never write, Teufel.

Encapsulation

In the video, Martin speaks of how the C programming language had strict encapsulation: a .h header file that provided the interface and a .c code file that provided the implementation. He then laments that nearly all object-oriented languages jumble interface and implementation together in one (or, worse, two) files, with compiler directives to make certain members “public”, “private”, “protected”, or other. (In Java that’s “package protected”.)

In the old days Martin was a force for greater professionalism in software development, but these days he comes off as a bit of a curmudgeon.

Nevertheless, I think he has a point. I’ve found myself writing Java libraries defined by publicly accessible interfaces and private implementations. As soon as I master the Java module syntax I’ll make the latter inaccessible from external code. Maybe it’s simply old C programming habits. Still, if you want truly modular code, the technique of hiding the gory details under a clean abstraction seems like the safest path.

Interface Inheritance

Java isn’t the only language with interfaces. Python has a sort of “duck typing” interface, and as I understand it Go was built with interfaces that a later programmer can retrofit onto existing code.

Python interfaces are frustrating, however, since there’s no necessary or explicit connection between an implementation class and its “protocol”. Furthermore Python leaves all protocol conformance to external tools. Then again, type checking relies on external tools, so something like

foo: int = "this is not an int"

will load and run just fine, even if mypy and other tools squawk.¹

Honestly I wish there was an interpreter mode that would insert assertions around typed code, so that any time you assigned a string to a variable that previously held an int the runtime would throw an exception. Maybe that’s impossible in the Python bytecode. The best I’ve found is the @runtime_checkable decorator for Protocols that allows one to test for protocol conformance using isinstance().

Data Classes

Another convenience of Python are “dataclasses”, which uses a decorator to declare a class’s (public) members. There’s even an option to declare them immutable.

A common pattern in multithreaded applications called Communicating Sequential Processes (CSP) divides a program into single-threaded islands of mutable data that send “messages” containing (notionally) immutable data to each other. The entire Erlang language relies on this paradigm, and parts of the Go and Clojure languages implement CSP. Functional or what I call “function-oriented”² languages do this kind of thing extremely well, because of their emphasis on functions without side effects and transparent data structures.

CPU manufacturers can only optimize their silicon so far; their solutions involve putting more cores on each chip. GPUs are faster, but they rely on massively parallel operations. All this points to a future where programs will become increasingly parallel, and CSP seems like an ideal strategy to cut down on synchronization locks and automate the writing of high-performance programs. In traditonal multithreading synchronization happens on each block of mutable data as an arbitrary number of threads contend to change it. In CSP synchronization happens only on either end of the pipe or queue between two processes.

As Python slowly retires its Global Interpreter Lock (GIL) maybe it will handle CSP-style applications efficiently. Maybe not.

Enter Teufel?

As I may have mentioned before, this language I may never write would gore some sacred cows of object orientation for the sake of enhancing encapsulation, ensuring type safety, and embracing a multi-threaded world.

Runtime entities would have an explicit public protocol and potentially hidden implementation details. (Details might be hidden by the programmer or by the language itself, e.g. memory management, implementation of built-in types.)
The compiler will assign a type to every runtime entity and every variable. (Ideally it will use implicit typing and type inference as much as possible, but some explicit type declarations will be necessary.) The compiler will perform type checking, but for extra security a compile-time switch can insert assertions of a runtime entity’s type.
To facilitate exact typing, the language will implement generic types and if necessary a type language for corner cases. Use of an “Any” type will be possible, but increasingly discouraged.
All declarations will be local to the enclosing scope (i.e. lexical scoping) unless declared otherwise. (“Globals” really complicate type checking.)
Constant declarations will be the default, but the syntax for declaring variables will not be onerous.
Operations called on a variable or constant must be defined over its type, or it will be a compile-time error.
A special assignment operator can narrow the type of an object at runtime to the type of the variable, or else return a none value if the types are incompatible. A second operator can test the type, which can be used in conditional expressions to narrow the type. I’m not sure which method would work better – type tests are more function-oriented, but assignment might suit procedural styles better – so I’d probably implement both.³

The Teufel Type System

The Teufel type system would include these not quite disjoint types.

Any

Any is the supertype of all runtime entities below. If a type is declared Any, programs would need to narrow the type before they could use the value.

None

None is the type with no values. It’s equivalent to void in C or None in Python. If the syntax requires a type but a Routine returns nothing, it’s implicitly or explicitly None.

The implementation may use a placeholder value equivalent to an empty Tuple or empty List in some cases. It will not have a single explicit null, nil, None, or undefined value; that will be left to Datatypes.

Datatype

Datatypes represent immutable data structures with public members. Syntactically a Datatype can only consist of references to Objects or other datatypes. Some may be “polymorphic” in the sense that a C union is polymorphic; all alternate forms are defined in the same spot.

Important subtypes of Datatype includes built-in “atomic types”, e.g. Numbers, Strings⁴, Booleans, Bit Sets⁵, and Timestamps⁶. I might further subdivide those basic types into implementation-based subtypes, e.g. Fixed_Integer vs. Big_Integer⁷ vs Float, or even restrict some variables to certain ranges of certain types as in Pascal. At runtime, though, operations on any atomic types produces a value general enough to accomodate the result.

Other built-in Datatypes might include:

List[T], which has the subtypes List.Empty and List.Cons(T, List[T]). It represents a (linked) sequence of immutable data.
Option[T], which has the subtypes Option.None and Option.Some[T]. It represents an immutable value which may be “null”.
Result[T, E], which has the subtypes Result.Ok[T] and Result.Err[string, E]. It represents a return value which may result in an error.
Tuple[T1, T2, …], which has innumerable subtypes. It represents a fixed-sized sequence of immutable data.

Except for Tuple, all of these (hopefully) can be defined within the language itself.

Because they’re immutable, threads can copy or pass datatypes, both their values and their metadata, freely among themselves.

Object

Object types have an identity and contain mutable data. An Object’s full type is defined by exactly one Class and one or more Protocols. Built-in Object types include:

Ref[T], a mutable pointer that dereferences to Option[T].
Array[T], a mutable, resizable sequence of elements of type T.
Table[K,V], a mutable, resizable mapping from a key of type K to a value of type V. If K is an Object type, the Table hashes the identity of the object, not its value.

Because they have identity and mutable state, all object instances stay local to only one thread to reduce synchronization. Migrating between threads requires them to serialize their state, reinstantiate themselves on a new thread, and which most applications would not need or want.

Routine

Routine types have a signature of input types and output types. (Plural). Routines may be further subdivided into:

Pure Functions, which manipulate only immutable types and must have a return value.
Predicates, which return only a boolean value.
Functions which return at least one value.
Procedures, which return no values.

Each Routine may also have:

zero or more preconditions, assertions on the routine’s arguments that must all must pass for the Routine to produce sensible results.
zero or more postconditions, assertions on the routine’s arguments and results that all must pass to verify the Routine produced the expected results.

Why preconditions? Multiple routines might have the same signature, yet take very different data values. Design by Contract, while somewhat redundant with software unit testing, attaches assertions to the code itself rather than to external code. The code to test assertions can be disabled at compile-time or runtime.

All threads share the code and metadata of Routines, or at least as much as they need to.

Protocol

A Protocol consists of a name, zero or more inherited Protocols, zero or more Constants, and a table of Messages. Each Message, in turn, consists of:

a name, implemented as a unique string
a signature, as defined above.
zero or more preconditions, as defined above.
zero or more postconditions, as defined above.

Each Protocol also includes zero or more invariants, which are preconditions and postconditions on an implementation’s entire observable state.

Semantically, any Routine used to implement a Message must conform to the preconditions and postconditions. They may widen the preconditions and narrow the postconditions, but not the other way around.

All threads share the metadata of Protocols, although keeping multiple distributed processes in sync can be a challenge.

Class

A Class, as stated previously, consists of one or more Protocols, zero or more generic type parameters (bound or filled), a set of Routines to create Class instances, and (effectively) a table that maps a Message to a “Feature”⁸, which is either a Routine or an instance variable

Note that classes do not inherit from each other. At all. This forces programmers to choose composition over inheritance … a bit draconian, but it makes implementation much easier.⁹

All Classes have at least one Protocol. If they do not explicitly inherit from one, the syntax will allow the source code to designate certain features “public”, creating a Protocol with the same name as the Class. Otherwise, all class features are “private”, accessed directly or indirectly through Protocol messages.¹⁰

All threads keep the exact implementation of Classes private, although in practice threads will share Class metadata and code.

Back to Reality?

Maybe there are existing languages like OCaml with enough features that I don’t have to write my own. Knowing me, though, I probably won’t be satisfied until I write my own weird Eiffel / Lua / SML / Objective-C / etc. hybrid.

Postscript: Naming Conventions

Today I just saw this video in the same series as the Uncle Bob one. I couldn’t agree more. If I ever write Teufel, the coding convention will be something like this¹¹:

protocol Some_Example                   -- Type name (Title Case)

    SOME_CONSTANT: Integer = 100        -- CONSTANT (all caps)

    some_property: Integer              -- property message (snake case)

    set_some_property(value: Integer)   -- property set message (snake case)
        alias
            "some_property="            -- Python/Ruby property setter?
        require
            maximum: value <= SOME_CONSTANT
            minimum: value >= 0
        ensure
            new_value: some_property = value

end

I never liked Java’s getX()/setX() convention. I guess it flagged methods as a property pair, but properly speaking a “property” is a message/function with no arguments (apart from the object itself) and (at least?) one return value. That’s what Python, Ruby, Eiffel, and other languages use. How much harder would it be to search for a second routine to see if the property is directly mutable? (As opposed to mutable through other methods.)

Anyway, I’ve only watched a few videos in the series, and I don’t remember seeing a disappointing one. (Maybe because they’re too short?) Even if you disagree, you have to think about why you disagree.

Postscript 2: Other Languages

Many other languages provide some of the features I’m cramming into my vaporware product Teufel:

CORBA, not so much a language as a network protocol, defines an object model in which each message to an object includes a fully qualified interface name and the name of the message.
Erlang essentially embodies CSP and the concept of serialized messages over channels between threads. The data model is fairly simple: numbers, “atoms” (symbols), tuples, (immutable) maps, lists, strings, records (tuples with named members), and some constructs to manage ports and and channels. (IIRC, every thread also has a mutable hashmap).
Go can retrofit an interface onto an existing data structure, perhaps in ways beyond Python’s duck-typed protocols. (I only barely scratched the surface.)
Lua demonstrates that an interpreted scripting language can be fast, simply by using a data model of hashtables, “userdata” references, and atomic values.
Objective-C, the first O.O. language I ever learned, binds “messages” to unique strings called selectors, and uses a hashtable lookup to find an implementation.
Rust functions avoid throwing exceptions by returning an object that contains either the requested result or an error message, which gets passed up the call stack to someone who can handle it. Unlike exceptions, flow of control passes naturally, so every function must either psas the error on or explicitly swallow it.
SML/NJ was my second function-oriented language after Lisp/Scheme, and influenced a lot of my thinking about data structures in functional languages.

Originally I planned to implement Teufel in C with an initial Java compiler (to take advantage of ANTLR, but I might want to use one of the languages above as an implementation language. Go in particular is garbage-collected so I don’t have to write my own, Erlang has its threading model and GC, adn Rust is supposed to be both type-safe and fast, neither of which are C’s strong points.

So many languages, so little time.

I haven’t gotten far into TypeScript, but I wonder how well it works given that the DOM and other APIs are themselves untyped. ↩︎
As a parallel to “object-oriented”. At a certain level of abstraction and sophistication, functions and objects begin to resemble each other. Is it an object that accepts “messages” and manipulates instance variables, or is it a closure on a function whose first argument is a “message” that manipulates captured variables? ↩︎
I’m not enamored with the Java “type cast” expression: it throws an exception rather than handling the common “false” case locally. ↩︎
Immutable Unicode strings, with the basic operations of concatenation, indexing, and Python-like slicing. Notionally, as in Python, each multi-character String is made up of one-character Strings. ↩︎
A potentially infinitely sized bit mask. ↩︎
A distinct UNIX-like timestamp type, because a) every language acquires a “date”, “time”, or “datetime” type sooner or later, but b) dates and calendars are hard. ↩︎
Infinite precision number, built in from the start. ↩︎
A term borrowed from Eiffel, Teufel’s sort of namesake. ↩︎
One reason C++ puts member variables in its header files is that anything that inherits from a C++ class has to reserve room for the superclass variables. (Objective-C had the same problem.) With no implementation inheritance, that problem goes away. Teufel datatypes would compile into C as structs, protocols and classes into runtime data structures with associated code, and object instances into mutable structs with a pointer to the class dispatch table. ↩︎
Java classes, for example, also have protected and package private members to define access to class members from subclasses and from members of the same package. I’m disallowing subclasses, so the first is not needed. As for the second, I’d prefer to restrict visibility through an external “module” system analogous to Java 9 modules over something like Java packages or Eiffel’s potentially chaotic visibility through naming specific classes in the exporting code itself. ↩︎
Actual syntax may start off a bit noisier, e.g. to differentiate constants, protocol Messages / class implementations, and instance variables, all of which have significant implementation differences. ↩︎