TDMA Routing on a Hypercube

Somehow, I invented a better connection machine.

My Thinkin’ Machine has 4,096 processors arranged at the vertices of a 12-dimensional hypercube. Each processor connects to 12 neighbors. There are no dedicated routers. Messages traverse the network using a time-division multiple access (TDMA) scheme. This message passing scheme has several interesting properties that turn my machine into something interesting:

This page explains how.

The Problem

Nodes must pass messages to each other, but the naive way of doing this – blasting out data and hoping everything works – has significant drawbacks.

Consider the constraints:

The original Connection Machine CM-1 solved this with dedicated routing silicon: buffers, arbitration logic, and deadlock avoidance built into custom ASICs. This is not an option when building a machine without routers, only using microcontrollers that cost under a dollar.

The Insight

In a hypercube, each node’s address is a binary number. Node 0x2A3 is connected to every node that differs by exactly one bit: 0x2A2 (bit 0), 0x2A1 (bit 1), 0x2AB (bit 3), and so on.

To route a message from source $S$ to destination $D$, XOR the addresses:

\[\Delta = S \oplus D\]

The set bits in $\Delta$ tell you which dimensions to traverse. If bit 7 is set, the message must cross dimension 7 at some point. The number of set bits equals the number of hops.

If you traverse dimensions in a fixed order — always dimension 0 before dimension 1 before dimension 2, and so on — you get deadlock-free routing for free. This is dimension-ordered routing, and it’s a known result in the literature (see Bertsekas 1991).

Divide that clock into $2n$ phases, where $n$ is the number of dimensions. Phase $2d$ activates dimension $d$ in the low→high direction; phase $2d+1$ activates the same dimension in the high→low direction. If link ownership is scheduled by dimension and direction, the routing algorithm collapses to: “wait for the right phase, then send.”

TDMA example waveforms

The Phase Schedule

A single communication cycle consists of 24 phases:

Phase Dimension Direction Transmitting Nodes
000→1Nodes where bit 0 = 0
101→0Nodes where bit 0 = 1
210→1Nodes where bit 1 = 0
311→0Nodes where bit 1 = 1
420→1Nodes where bit 2 = 0
521→0Nodes where bit 2 = 1
22110→1Nodes where bit 11 = 0
23111→0Nodes where bit 11 = 1

In phase $2d$, nodes with address bit $d = 0$ transmit to their neighbor in dimension $d$. In phase $2d + 1$, nodes with address bit $d = 1$ transmit. Every link is active exactly twice per cycle — once in each direction — and there is never contention for a link.

Dimension-Ordered Routing

Given source $S$ and destination $D$, compute $\Delta = S \oplus D$. The message traverses dimensions in order, using the appropriate phase for each hop.

For each dimension $d$ from 0 to 11:

Because dimensions are traversed in order and phases are ordered by dimension, the sequence of phases used is always monotonically increasing. A message never waits for a phase to “come around again.”

Worked Example & Bounded Latency

An Average-Case Routing Task: Route a message from node 0x2A3 (binary 0010 1010 0011) to node 0x91C (binary 1001 0001 1100).

\[\Delta = \texttt{0x2A3} \oplus \texttt{0x91C} = \texttt{0xB3F} = \texttt{1011 0011 1111}_2\]

The set bits are: 0, 1, 2, 3, 4, 5, 8, 9, 11. That’s 9 hops.

Hop Current Node Dimension Current Bit Phase Next Node
10x2A30110x2A2
20x2A21130x2A0
30x2A02040x2A4
40x2A43060x2AC
50x2AC4080x2BC
60x2BC51110x29C
70x29C80160x19C
80x19C91190x11C
90x11C110220x91C

Phase sequence: 1, 3, 4, 6, 8, 11, 16, 19, 22. Strictly increasing. Message delivered in one cycle.

Worst-Case Routing: Route a message from node 0xFFF (binary 1111 1111 1111) to node 0x000 (binary 0000 0000 000). This is the maximum possible Hamming distance on a 12-bit address space. Every bit differs.

\[\Delta = \texttt{0xFFF} \oplus \texttt{0x000} = \texttt{0xFFF} = \texttt{1111 1111 1111}_2\]

Because the source starts with all bits = 1, every hop uses the odd phase for that dimension (the 1→0 direction): phase $2d+1$.

Hop Current Node Dimension Current Bit Phase Next Node
10xFFF0110xFFE
20xFFE1130xFFC
30xFFC2150xFF8
40xFF83170xFF0
50xFF04190xFE0
60xFE051110xFC0
70xFC061130xF80
80xF8071150xF00
90xF0081170xE00
100xE0091190xC00
110xC00101210x800
120x800111230x000

Phase sequence: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23. Strictly increasing. Message delivered in one cycle.

Now route the opposite direction from node 0x000 (binary 0000 0000 000) to node 0xFFF (binary 1111 1111 1111).

\[\Delta = \texttt{0xFFF} \oplus \texttt{0x000} = \texttt{0xFFF} = \texttt{1111 1111 1111}_2\]

Because the source starts with all bits = 0, every hop uses the even phase for that dimension (the 0→1 direction): phase $2d$.

Hop Current Node Dimension Current Bit Phase Next Node
10x0000000x001
20x0011020x003
30x0032040x007
40x0073060x00F
50x00F4080x01F
60x01F50100x03F
70x03F60120x07F
80x07F70140x0FF
90x0FF80160x1FF
100x1FF90180x3FF
110x3FF100200x7FF
120x7FF110220xFFF

Phase sequence: 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22. Strictly increasing. Message delivered in one cycle.

Bounded Latency

The maximum number of hops in a 12-dimensional hypercube is 12 (all bits differ). The maximum number of phases is 24. Every message — regardless of source, destination, or network load — completes within one 24-phase cycle.

This isn’t worst-case latency. This is every-case latency. Not “usually fast, sometimes awful.” Not “fine until the network gets busy.” The schedule makes the network a metronome: every link gets a turn, every turn has an owner, and the longest possible route still fits inside a single superframe.

Latency stops being a statistical property and becomes a design constant. You can write software assuming, “a message sent now will be delivered by the end of this cycle,” the same way you assume a flip-flop will settle by the next clock edge. The whole system becomes synchronous at the network level. The machine isn’t just connected; it’s coordinated. It doesn’t chatter. It breathes.

Wall-clock latency depends on phase duration:

Phase Rate Phase Duration Cycle Duration 12-Hop Latency
1 kHz1 ms24 ms24 ms
10 kHz100 µs2.4 ms2.4 ms
100 kHz10 µs240 µs240 µs

At 10 kHz phase rate with 1 Mbps baud, each phase carries ~100 bits — enough for a 10-byte payload plus header. The entire machine can exchange data at 2+ Gbps aggregate bandwidth with deterministic 2.4 ms latency.

What Falls Out

The TDMA scheme eliminates entire categories of complexity:

No collision detection. Link ownership is determined by the schedule. Two nodes never try to transmit on the same link simultaneously.

No backoff or retries. Collisions don’t happen, so recovery logic doesn’t exist.

No buffering beyond one message. A node receives at most one message per link per phase. The “buffer” is a single register.

No deadlock. Dimension-ordered routing is provably deadlock-free, and the phase schedule enforces dimension ordering automatically.

Trivial routing logic. The algorithm at each node:

void route_message(uint16_t dest, uint8_t *payload) {
    uint16_t delta = my_addr ^ dest;
    for (int dim = 0; dim < 12; dim++) {
        if (delta & (1 << dim)) {
            int phase = 2 * dim + ((my_addr >> dim) & 1);
            wait_for_phase(phase);
            transmit(dim, payload);
            return; // next hop handles remaining dimensions
        }
    }
}

That’s it. The complexity is in the schedule, not the silicon.

Why This Works Here

This scheme exists because of constraints. The project couldn’t afford dedicated router ASICs. I couldn’t find a microcontroller with 12 hardware UARTs. The one UART I had could be remapped to different pins at runtime.

TDMA solves all of these problems simultaneously:

The original Connection Machine CM-1 used dedicated routing hardware because the constraints were different: custom silicon, DARPA budget, and a goal of supporting arbitrary asynchronous communication patterns. The routers were complex because they were solving the general case. My workloads are synchronous. Compute, exchange, compute, exchange. For that pattern, the general case is unnecessary overhead. Lockstep TDMA is a feature, not a limitation.

My Context

This sounds smart, and it probably is, but I must admit I’m a bit out of my depth here. Not so much in that I couldn’t think of how to do this, but my depth doesn’t extend to why this wasn’t done before. I’ve checked the literature, and the ideas are there. Bertsekas goes through the dimension ordering of message passing through a hypercube, and Stout sort of, kind of, applies this ‘master scheduler clock scheme’. No one put these two ideas together to create a collision-free hypercube with bounded latency.

The original Connection Machine CM-1 solved a slightly different problem with its routers. The CM-1 solved general purposes communication under unpredictable traffic. This required buffers, arbitration, and deadlock avoidance. In the CM-1, there is a lot of silicon dedicated to simply passing messages between nodes.

This TDMA scheme requires effectively zero silicon, and only needs a relatively slow global clock. I lose the idea that ‘any node can talk to any dimension at any time’, but if you zoom out to the superset of the phase clock that doesn’t really matter at all. It’s just a small implementation detail.

Am I the first person to think of this? I have no idea.

References

back to main project page

main