Architecture / Sequence Transduction
Seq2Seq and Encoder-Decoder: From Fixed Semantic Vectors to Attention Alignment
Map one sequence into another: the encoder reads the source, the decoder autoregressively generates the target, and attention reselects input evidence at every step.
Mechanism Lab
Animation: the encoder reads source, attention routes evidence, and the decoder writes target
The animation starts with source tokens and encoder states, shows the fixed-context bottleneck, opens the attention heatmap, and lets the decoder read dynamic context at each output step.
Step 1 / 5
Encode
The encoder reads source tokens and writes position-level hidden states h_i.
h_i=f_enc(x_i,h_{i-1})Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
Seq2Seq solves variable-length sequence-to-sequence problems: translation, summarization, code generation, table-to-text, and reviewer-comments-to-revision-plan workflows.
The encoder turns x_1,...,x_T into hidden states. Early models used only a final context vector c, creating an information bottleneck.
The decoder generates output according to p(y_t | y_<t, x). Training often uses teacher forcing; inference can only condition on previously generated tokens.
Attention lets the decoder compute source-position weights alpha_{t,i} at every output step, replacing a fixed c with dynamic context c_t and improving long-input alignment.
02 / Math
Probability factorization, bottlenecks, and attention
01 / Conditional sequence probability
The objective is to model the target sequence distribution given an input sequence. The chain rule decomposes sentence probability into next-token probabilities.
p(y|x) = prod_{t=1}^M p(y_t | y_{<t}, x)02 / Encoder states
An RNN encoder reads source tokens and produces one hidden state h_i per position. Bidirectional encoders concatenate forward and backward states.
h_i = f_enc(E_x[x_i], h_{i-1})03 / Fixed-context bottleneck
Original encoder-decoder models often compressed the whole source into c, such as the final state h_T. Every decoder step then depends on the same c.
c = q(h_1,...,h_T), s_t=f_dec(E_y[y_{t-1}], s_{t-1}, c)04 / Attention scores
At target step t, attention scores the previous decoder state against each encoder state, then normalizes scores into alignment weights.
e_{t,i}=a(s_{t-1},h_i), alpha_{t,i}=softmax_i(e_{t,i})05 / Dynamic context
The context vector becomes a weighted average of source hidden states, so different output tokens can look at different input positions.
c_t = sum_i alpha_{t,i} h_i06 / Training and inference gap
Training minimizes negative log likelihood with teacher forcing; inference uses greedy decoding or beam search, creating exposure bias and length bias.
L = -sum_t log p(y_t^* | y_{<t}^*, x)03 / Code
NumPy demo: minimal Seq2Seq forward pass with dot-product attention
This framework-free snippet makes encoder states, attention weights, dynamic context, and decoder logits explicit.
import numpy as np
def softmax(z):
z = z - z.max()
exp_z = np.exp(z)
return exp_z / exp_z.sum()
rng = np.random.default_rng(13)
vocab_in, vocab_out = 18, 20
d_emb, d_h = 5, 6
source = np.array([2, 5, 7, 11]) # x_1 ... x_T
target_in = np.array([1, 4, 8]) # <bos>, y_1, y_2 under teacher forcing
E_src = rng.normal(size=(vocab_in, d_emb)) / np.sqrt(d_emb)
E_tgt = rng.normal(size=(vocab_out, d_emb)) / np.sqrt(d_emb)
Wxh = rng.normal(size=(d_emb, d_h)) / np.sqrt(d_emb)
Whh = rng.normal(size=(d_h, d_h)) / np.sqrt(d_h)
bh = np.zeros(d_h)
Watt = rng.normal(size=(d_h, d_h)) / np.sqrt(d_h)
Wdec = rng.normal(size=(d_emb + d_h, d_h)) / np.sqrt(d_emb + d_h)
Ws = rng.normal(size=(d_h, d_h)) / np.sqrt(d_h)
bd = np.zeros(d_h)
Wo = rng.normal(size=(2 * d_h, vocab_out)) / np.sqrt(2 * d_h)
bo = np.zeros(vocab_out)
# Encoder: produce one hidden state per source token.
h = np.zeros(d_h)
encoder_states = []
for token in source:
x_i = E_src[token]
h = np.tanh(x_i @ Wxh + h @ Whh + bh)
encoder_states.append(h)
encoder_states = np.stack(encoder_states) # [T, d_h]
# Decoder: teacher forcing plus attention at every output step.
s = encoder_states[-1]
logits = []
alignments = []
for token in target_in:
y_prev = E_tgt[token]
scores = encoder_states @ Watt @ s
alpha = softmax(scores)
context = alpha @ encoder_states
decoder_input = np.concatenate([y_prev, context])
s = np.tanh(decoder_input @ Wdec + s @ Ws + bd)
logits_t = np.concatenate([s, context]) @ Wo + bo
logits.append(logits_t)
alignments.append(alpha)
print("encoder states:", encoder_states.shape)
print("decoder logits:", np.stack(logits).shape)
print("attention weights for step 2:", alignments[1].round(3))04 / Case
Case: translating one research language into another
- The classic Seq2Seq use case is machine translation, but in a StatsPAI setting it can translate a research task into draft code, a regression table into prose, or reviewer comments into a revision checklist.
- Suppose the input is four reviewer-comment blocks: data source, identification assumptions, robustness, and writing structure. The encoder creates one state per block; the decoder generates each response-letter sentence.
- Without attention, all evidence is compressed into one fixed vector and later sentences may forget early comments. With attention, each generated sentence can realign to the relevant comment block.
- For empirical-research assistants, the alignment heatmap becomes an audit trace: when a rebuttal sentence is generated, which comment or result table did the model mainly route through?
05 / Risks
Common Pitfalls
References
- Sutskever, Vinyals, and Le (2014), Sequence to Sequence Learning with Neural Networkshttps://arxiv.org/abs/1409.3215
- Cho et al. (2014), Learning Phrase Representations using RNN Encoder-Decoderhttps://arxiv.org/abs/1406.1078
- Bahdanau, Cho, and Bengio (2014), Neural Machine Translation by Jointly Learning to Align and Translatehttps://arxiv.org/abs/1409.0473
- Luong, Pham, and Manning (2015), Effective Approaches to Attention-based Neural Machine Translationhttps://arxiv.org/abs/1508.04025