Standard Transformer models generate text purely autoregressively—each token is predicted based only on the previous tokens, like a stateless function where the only “memory” is the input sequence itself. The Free Transformer adds a learned latent variable layer in the middle of the network that acts like hidden internal state the model can condition on during generation. Think of it as giving the model a small amount of working memory (16 bits per token) to make implicit decisions about the generation strategy before committing to specific tokens. During training, an encoder network learns to set these latent variables appropriately for each training example (using a Variational Autoencoder framework), while during inference they’re sampled randomly—but the model has learned to use whatever random values it gets to organize its generation process more effectively. The practical result is that with only 3% additional overhead (one extra transformer block for the encoder), the model shows 3-11% improvements on complex tasks like code generation and mathematical reasoning, because it can effectively “plan” aspects of the output structure rather than having to reconstruct everything purely from the token sequence so far.​​​​​​​​​​​​​​​​

arxiv.org/pdf/2510….