← Back to writings

Visually Explaining AlphaFold 2 & 3

AlphaFold 3 is one of the more architecturally interesting models I've come across. The paper is good but moves fast, and I found myself wanting a slower, visual walkthrough I could actually sit with. This is that walkthrough.

It assumes you're comfortable with attention. If you need a refresher, The Illustrated Transformer is the best one out there. It won't cover why protein structure prediction matters or what AlphaFold changed for biology. There's plenty written on that. The focus here is purely mechanical: how molecules are represented inside the model, and what operations turn them into a predicted 3D structure.

Overview

AF3 predicts the structure of a protein, optionally complexed with other proteins, nucleic acids, or small molecules, entirely from sequence. That broader input space is the first thing that sets it apart from AF2. A quick note on terminology used throughout: a "token" is a single amino acid (for proteins), a single nucleotide (for DNA/RNA), or an individual atom for anything that doesn't fit those two.

The model has three stages: Input Preparation (embedding sequences and querying structural databases), Representation Learning (refining pair and single representations through stacked attention modules), and Structure Prediction (generating 3D coordinates via diffusion). Click any component in the diagram to explore it.

Two types of representation run through the model. A single representation ($N \times C$) stores one feature vector per token or atom, capturing individual identity. A pair representation ($N \times N \times C$) stores one feature vector per pair, capturing relationships between them. Most computation happens in the pair representation because structure prediction is about distances and orientations between entities, not individual properties alone.

AlphaFold 3: Full Architecture

From AF2 to AF3

AF2 established the pair and single representation framework that AF3 inherits. To understand what AF3 adds and why, it helps to know what AF2 did and what it could not. AF2 predicted protein structures only, operated entirely at the residue level throughout its trunk (one vector per amino acid, all the way through), and was deterministic: the same input always produced the same output. Its architecture had two main stages: a 48-block Evoformer that jointly refined an MSA representation and a pair representation, followed by a structure module that used residue-specific backbone frames and torsion angles to produce 3D coordinates. The structure module had to be rotationally invariant, which led to the Invariant Point Attention (IPA) design.

AF3 keeps the pair representation and all the triangle-based operations that update it. These carry over largely unchanged into the Pairformer and are the most proven part of AF2. What AF3 replaces is everything around them. The 48-block Evoformer is replaced by a simpler 4-block MSA module followed by a 48-block Pairformer that operates on pairs and singles only. After those 4 MSA blocks, the MSA representation is discarded entirely. The structure module is replaced by a diffusion module that works directly on raw atom coordinates. The one-sentence version: AF3 keeps AF2's pair processing, cuts the MSA processing drastically, and replaces the structure module with diffusion.

AF3 is generative, not deterministic

The most consequential change is that AF3 is a generative model. The diffusion module is trained to denoise atomic coordinates: given corrupted positions, predict the true ones. At inference time, random noise is sampled and iteratively denoised. The same input run twice with different random seeds will produce different structures. For most complex types, this variation is small and confidence-based ranking selects the best sample from a handful of seeds. For antibody-antigen complexes, performance keeps improving even up to 1,000 seeds, which reflects genuine geometric uncertainty at the interface rather than a quirk of sampling. This is a fundamental shift in what the model is doing: AF2 predicted a single answer, AF3 draws from a distribution over plausible structures.

Equivariance is dropped

AF2's IPA was carefully designed so that the attention operation remained invariant to global rotations and translations of the molecule. This made sense when the model operated inside residue backbone frames and needed to reason about relative orientations. AF3 abandons rotational invariance entirely. The diffusion training objective does not require it, and dropping it simplifies handling arbitrary chemistry: ligands, nucleic acids, and modified residues all have molecular graphs that do not fit naturally into a residue-frame-based equivariant design. No IPA, no FAPE loss, no backbone frame representation.

Atom-level representations are new

AF2 worked entirely at the residue level in its trunk. There was no atom-level computation until the structure module's final side-chain prediction step. AF3 introduces atom-level representations in the input embedder, specifically the atom single representation c, the atom pair representation p, and the Atom Transformer output q. These have no AF2 analog. They are necessary because ligands and modified residues cannot be compressed to a single residue vector the way standard amino acids can. The input embedder computes these atom-level features from conformer geometry, then aggregates them to the token level before the Pairformer runs, bridging the two scales.

Confidence training required a new procedure

In AF2, the confidence head was trained by comparing the structure module output directly to the true structure at each training step. This does not transfer to diffusion: each training step only denoises one noise level and never produces a complete structure. AF3 solves this with a rollout procedure. During training, the full diffusion process is run at a coarser step size to generate a complete predicted structure, and that structure is used to supervise the confidence head. The confidence outputs are similar to AF2 (per-residue pLDDT and a predicted aligned error matrix PAE) with one addition: a predicted distance error matrix (PDE) that measures error in the predicted distance matrix.

Hallucination and cross-distillation

Generative models tend to hallucinate: they invent compact, plausible-looking structure even in regions that are genuinely disordered. AF3 addresses this by enriching its training data with structures predicted by AlphaFold-Multimer v2.3. In those predictions, disordered regions appear as extended loops rather than compact globules. Training on them teaches AF3 to produce ribbon-like disorder with low confidence rather than incorrectly confident compact folds. This cross-distillation is not a small fix: without it, AF3 hallucination rates are substantially higher.

The MSA de-emphasis is a principled bet

Reducing MSA processing from 48 blocks to 4 is not just an efficiency cut. AF3 is making a claim: that evolutionary coevolution signal is not required for cross-entity interactions like protein-ligand or protein-nucleic acid binding. These interactions are primarily governed by local chemistry and geometry rather than by how residues co-evolved across species. The results support this. AF3 substantially outperforms specialized docking tools on protein-ligand binding despite using far less MSA processing than AF2 used for protein-only prediction.