TDMA Routing on a Hypercube

Somehow, I invented a better connection machine.

My Thinkin’ Machine has 4,096 processors arranged at the vertices of a 12-dimensional hypercube. Each processor connects to 12 neighbors. There are no dedicated routers. Messages traverse the network using a time-division multiple access (TDMA) scheme. This message passing scheme has several interesting properties that turn my machine into something interesting:

Deterministic link ownership (no collisions on a link, ever)
Dimension-ordered routing baked into time (deadlock-avoidance and routing simplicity fall out)
Bounded latency that’s a function of the superframe length (24 phases) and phase duration (the entire machine is synchronous)

This page explains how.

The Problem

Nodes must pass messages to each other, but the naive way of doing this – blasting out data and hoping everything works – has significant drawbacks. Consider the constraints:

4,096 nodes, each with 12 single-wire, half-duplex links to neighbors
No dedicated router hardware routing logic runs on the node processors themselves
Single wire connection links between nodes are a single wire, open drain.
Single hardware UART per node must be remapped to different pins for different dimensions
No collision detection or backoff the links can’t do CSMA elegantly

The original Connection Machine CM-1 solved this with dedicated routing silicon. This was an exceptionally difficult problem because Thinking Machines had to pull a Nobel laureate in Physics out of retirement to get these routers to actually work. I do not have the luxury of building custom ASICs for this project; I’m building it with microcontrollers that cost less than a dollar.

The Insight

In a hypercube, each node’s address is a binary number. Node 0x2A3 is connected to every node that differs by exactly one bit: 0x2A2 (bit 0), 0x2A1 (bit 1), 0x2AB (bit 3), and so on. To route a message from source $S$ to destination $D$, XOR the addresses:

\[\Delta = S \oplus D\]

The set bits in $\Delta$ tell you which dimensions to traverse. If bit 7 is set, the message must cross dimension 7 at some point. The number of set bits equals the number of hops.

If you traverse dimensions in a fixed order, always dimension 0 before dimension 1 before dimension 2, and so on, you get deadlock-free routing for free. This is dimension-ordered routing, and it’s a known result in the literature (see Bertsekas 1991).

Divide that clock into $2n$ phases, where $n$ is the number of dimensions. Phase $2d$ activates dimension $d$ in the low→high direction; phase $2d+1$ activates the same dimension in the high→low direction.

Now collapse this back to hardware. We have a (slowish) global clock. Each chip has an binary address. On clock phase $2d$, if that chips address bit $d$ is $0$, we map the pin for dimension $d$ connections to the hypercube UART transmit. On clock phase $2x+1$, and if that chips address bit $d$ is $1$, we map the pin for dimension $d$ connections to the UART receive.

We multiplex the UART for each chips hypercube connections through time. That’s it, that’s the entire trick. This requires a few things:

A slow global clock, on the order of 1 to 10kHz. This is fairly easy even in a large machine because dealing with a 1kHz clock is a piece of cake.
A microcontroller where UART peripherals can be remapped to any pin. This is difficult, but there are chips that can do it.
Optionally, and to make things very easy, a second ‘sync’ clock that ensures the phase number is the same on all chips.

TDMA example waveforms

The Phase Schedule

A single communication cycle consists of 24 phases:

Phase	Dimension	Direction	Transmitting Nodes
0	0	0→1	Nodes where bit 0 = 0
1	0	1→0	Nodes where bit 0 = 1
2	1	0→1	Nodes where bit 1 = 0
3	1	1→0	Nodes where bit 1 = 1
4	2	0→1	Nodes where bit 2 = 0
5	2	1→0	Nodes where bit 2 = 1
…	…	…	…
22	11	0→1	Nodes where bit 11 = 0
23	11	1→0	Nodes where bit 11 = 1

In phase $2d$, nodes with address bit $d = 0$ transmit to their neighbor in dimension $d$. In phase $2d + 1$, nodes with address bit $d = 1$ transmit. Every link is active exactly twice per cycle — once in each direction — and there is never contention for a link.

Dimension-Ordered Routing

Given source $S$ and destination $D$, compute $\Delta = S \oplus D$. The message traverses dimensions in order, using the appropriate phase for each hop.

For each dimension $d$ from 0 to 11:

If bit $d$ of $\Delta$ is not set, skip this dimension
If bit $d$ of the current address is 0, transmit in phase $2d$
If bit $d$ of the current address is 1, transmit in phase $2d + 1$

Because dimensions are traversed in order and phases are ordered by dimension, the sequence of phases used is always monotonically increasing. A message never waits for a phase to “come around again.”

Worked Example & Bounded Latency

An Average-Case Routing Task: Route a message from node 0x2A3 (binary 0010 1010 0011) to node 0x91C (binary 1001 0001 1100).

\[\Delta = \texttt{0x2A3} \oplus \texttt{0x91C} = \texttt{0xB3F} = \texttt{1011 0011 1111}_2\]

The set bits are: 0, 1, 2, 3, 4, 5, 8, 9, 11. That’s 9 hops.

Hop	Current Node	Dimension	Current Bit	Phase	Next Node
1	0x2A3	0	1	1	0x2A2
2	0x2A2	1	1	3	0x2A0
3	0x2A0	2	0	4	0x2A4
4	0x2A4	3	0	6	0x2AC
5	0x2AC	4	0	8	0x2BC
6	0x2BC	5	1	11	0x29C
7	0x29C	8	0	16	0x19C
8	0x19C	9	1	19	0x11C
9	0x11C	11	0	22	0x91C

Phase sequence: 1, 3, 4, 6, 8, 11, 16, 19, 22. Strictly increasing. Message delivered in one cycle.

Worst-Case Routing: Route a message from node 0xFFF (binary 1111 1111 1111) to node 0x000 (binary 0000 0000 000). This is the maximum possible Hamming distance on a 12-bit address space. Every bit differs.

\[\Delta = \texttt{0xFFF} \oplus \texttt{0x000} = \texttt{0xFFF} = \texttt{1111 1111 1111}_2\]

Because the source starts with all bits = 1, every hop uses the odd phase for that dimension (the 1→0 direction): phase $2d+1$.

Hop	Current Node	Dimension	Current Bit	Phase	Next Node
1	0xFFF	0	1	1	0xFFE
2	0xFFE	1	1	3	0xFFC
3	0xFFC	2	1	5	0xFF8
4	0xFF8	3	1	7	0xFF0
5	0xFF0	4	1	9	0xFE0
6	0xFE0	5	1	11	0xFC0
7	0xFC0	6	1	13	0xF80
8	0xF80	7	1	15	0xF00
9	0xF00	8	1	17	0xE00
10	0xE00	9	1	19	0xC00
11	0xC00	10	1	21	0x800
12	0x800	11	1	23	0x000

Phase sequence: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23. Strictly increasing. Message delivered in one cycle.

Now route the opposite direction from node 0x000 (binary 0000 0000 000) to node 0xFFF (binary 1111 1111 1111).

\[\Delta = \texttt{0xFFF} \oplus \texttt{0x000} = \texttt{0xFFF} = \texttt{1111 1111 1111}_2\]

Because the source starts with all bits = 0, every hop uses the even phase for that dimension (the 0→1 direction): phase $2d$.

Hop	Current Node	Dimension	Phase	Next Node
1	0x000	0	0	0x001
2	0x001	1	2	0x003
3	0x003	2	4	0x007
4	0x007	3	6	0x00F
5	0x00F	4	8	0x01F
6	0x01F	5	10	0x03F
7	0x03F	6	12	0x07F
8	0x07F	7	14	0x0FF
9	0x0FF	8	16	0x1FF
10	0x1FF	9	18	0x3FF
11	0x3FF	10	20	0x7FF
12	0x7FF	11	22	0xFFF

Phase sequence: 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22. Strictly increasing. Message delivered in one cycle.

Bounded Latency

The maximum number of hops in a 12-dimensional hypercube is 12 (all bits differ). The maximum number of phases is 24. Every message — regardless of source, destination, or network load — completes within one 24-phase cycle. This isn’t worst-case latency, it’s the latency for every possible message between every possible pairs of nodes in the hypercube.

Wall-clock latency depends on phase duration:

Phase Rate	Phase Duration	Cycle Duration	12-Hop Latency
1 kHz	1 ms	24 ms	24 ms
10 kHz	100 µs	2.4 ms	2.4 ms
100 kHz	10 µs	240 µs	240 µs

At 10 kHz phase rate with 1 Mbps baud, each phase carries ~100 bits — enough for a 10-byte payload plus header. The entire machine can exchange data at 2+ Gbps aggregate bandwidth with deterministic 2.4 ms latency.

What Falls Out

The Connection Machine was famous beyond its blinkenlights for what fell out of the architecture. A bitonic sort algorithm is perfect for hypercube clusters of computers. Using the NEWS grid of the Connection Machine, intelligence services had the fastest image processing machine on the planet. This TDMA scheme is the same thing; really cool stuff just falls out of the architecture:

No collision detection. Two nodes never try to transmit on the same link simultaneously.

No backoff or retries. Collisions don’t happen, so recovery logic doesn’t exist.

No buffering beyond one message. A node receives at most one message per link per phase. The “buffer” is a single register.

No deadlock. Dimension-ordered routing is provably deadlock-free, and the phase schedule enforces dimension ordering automatically.

Trivial routing logic. The algorithm at each node:

void route_message(uint16_t dest, uint8_t *payload) {
    uint16_t delta = my_addr ^ dest;
    for (int dim = 0; dim < 12; dim++) {
        if (delta & (1 << dim)) {
            int phase = 2 * dim + ((my_addr >> dim) & 1);
            wait_for_phase(phase);
            transmit(dim, payload);
            return; // next hop handles remaining dimensions
        }
    }
}

That’s it. The complexity is in the schedule, not the silicon.

Why This Works Here

This scheme exists because of constraints. The project couldn’t afford dedicated router ASICs. I couldn’t find a microcontroller with 12 hardware UARTs. The one UART I had could be remapped to different pins at runtime. TDMA solves all of these problems simultaneously. One UART handles all 12 dimensions because we’re only using one dimension at a time. There’s no router hardware because the TDMA schedule is the router.

The original Connection Machine CM-1 used dedicated routing hardware because the constraints were different: custom silicon, DARPA budget, and a goal of supporting arbitrary asynchronous communication patterns. The routers were complex because they were solving the general case. My workloads are synchronous. Compute, exchange, compute, exchange. For that pattern, the general case is unnecessary overhead. Lockstep TDMA is a feature, not a limitation.

There is one gigantic, glaring problem with this idea. A naive interpretation of this routing scheme would complain that I just reduced the potential speed of the machine by a factor of 24. This is true, but it’s lacking context. This paper from NASA Ames in 1990 assessed the Connection Machine from a programming standpoint, and the takeaway is not good.

The NASA assessment of the Connection Machine says, “irregular communication through the router is very expensive and should be sparingly used.” Also, _“The router is the key to the CM. The current machine has a router that runs at roughly 1% of floating point speed. This makes router-intensive algorithms unattractive in terms of their performance.”

The CM-1’s general purpose router, the thing that made it flexible, was so slow that NASA’s assessment was basically, “don’t use it”. The machine could route messages arbitrarily and asynchronously but in practice nobody used it because it was a hundred times slower than compute.

The TDMA scheme gives up the theoretical capability of arbitrary asynchronous routing, but that was unusable to begin with. In exchange, I get structured communication that I can actually use every cycle.

My Context

This sounds smart, and it probably is, but I must admit I’m a bit out of my depth here. Not so much in that I couldn’t think of how to do this, but my depth doesn’t extend to why this wasn’t done before. I’ve checked the literature, and the ideas are there. Bertsekas goes through the dimension ordering of message passing through a hypercube, and Stout sort of, kind of, applies this ‘master scheduler clock scheme’. Kopetz says, ‘yeah man, just divide the communication up by time’. No one put these two ideas together to create a collision-free hypercube with bounded latency.

The original Connection Machine CM-1 solved a slightly different problem with its routers. The CM-1 solved general purposes communication under unpredictable traffic. This required buffers, arbitration, and deadlock avoidance. In the CM-1, there is a lot of silicon dedicated to simply passing messages between nodes.

This TDMA scheme requires effectively zero silicon, and only needs a relatively slow global clock. I lose the idea that ‘any node can talk to any dimension at any time’, but if you zoom out to the superset of the phase clock that doesn’t really matter at all. It’s just a small implementation detail.

Am I the first person to think of this? I have no idea.

References

D. P. Bertsekas et al., “Optimal Communication Algorithms for Hypercubes,” Journal of Parallel and Distributed Computing, 1991. PDF
Q. F. Stout, “Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines,” Journal of Parallel and Distributed Computing, 1990. Abstract
Kopetz, H., & Grünsteidl, G. (1994). “TTP—A Protocol for Fault-Tolerant Real-Time Systems.” IEEE Computer, 27(1), 14–23. PDF
Robert Schreiber, “An Assessment of the Connection Machine,” RIACS Technical Report 90.40 (June 1990) PDF

back to main project page

main