Architecture / Sequence Transduction

Seq2Seq and Encoder-Decoder: From Fixed Semantic Vectors to Attention Alignment

Map one sequence into another: the encoder reads the source, the decoder autoregressively generates the target, and attention reselects input evidence at every step.

Mechanism Lab

Animation: the encoder reads source, attention routes evidence, and the decoder writes target

The animation starts with source tokens and encoder states, shows the fixed-context bottleneck, opens the attention heatmap, and lets the decoder read dynamic context at each output step.

Step 1 / 5

Encode

The encoder reads source tokens and writes position-level hidden states h_i.

h_i=f_enc(x_i,h_{i-1})

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

Seq2Seq solves variable-length sequence-to-sequence problems: translation, summarization, code generation, table-to-text, and reviewer-comments-to-revision-plan workflows.

The encoder turns x_1,...,x_T into hidden states. Early models used only a final context vector c, creating an information bottleneck.

The decoder generates output according to p(y_t | y_<t, x). Training often uses teacher forcing; inference can only condition on previously generated tokens.

Attention lets the decoder compute source-position weights alpha_{t,i} at every output step, replacing a fixed c with dynamic context c_t and improving long-input alignment.

02 / Math

Probability factorization, bottlenecks, and attention

01 / Conditional sequence probability

The objective is to model the target sequence distribution given an input sequence. The chain rule decomposes sentence probability into next-token probabilities.

p(y|x) = prod_{t=1}^M p(y_t | y_{<t}, x)

02 / Encoder states

An RNN encoder reads source tokens and produces one hidden state h_i per position. Bidirectional encoders concatenate forward and backward states.

h_i = f_enc(E_x[x_i], h_{i-1})

03 / Fixed-context bottleneck

Original encoder-decoder models often compressed the whole source into c, such as the final state h_T. Every decoder step then depends on the same c.

c = q(h_1,...,h_T), s_t=f_dec(E_y[y_{t-1}], s_{t-1}, c)

04 / Attention scores

At target step t, attention scores the previous decoder state against each encoder state, then normalizes scores into alignment weights.

e_{t,i}=a(s_{t-1},h_i), alpha_{t,i}=softmax_i(e_{t,i})

05 / Dynamic context

The context vector becomes a weighted average of source hidden states, so different output tokens can look at different input positions.

c_t = sum_i alpha_{t,i} h_i

06 / Training and inference gap

Training minimizes negative log likelihood with teacher forcing; inference uses greedy decoding or beam search, creating exposure bias and length bias.

L = -sum_t log p(y_t^* | y_{<t}^*, x)

03 / Code

NumPy demo: minimal Seq2Seq forward pass with dot-product attention

This framework-free snippet makes encoder states, attention weights, dynamic context, and decoder logits explicit.

import numpy as np

def softmax(z):
    z = z - z.max()
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()

rng = np.random.default_rng(13)
vocab_in, vocab_out = 18, 20
d_emb, d_h = 5, 6

source = np.array([2, 5, 7, 11])      # x_1 ... x_T
target_in = np.array([1, 4, 8])       # <bos>, y_1, y_2 under teacher forcing

E_src = rng.normal(size=(vocab_in, d_emb)) / np.sqrt(d_emb)
E_tgt = rng.normal(size=(vocab_out, d_emb)) / np.sqrt(d_emb)

Wxh = rng.normal(size=(d_emb, d_h)) / np.sqrt(d_emb)
Whh = rng.normal(size=(d_h, d_h)) / np.sqrt(d_h)
bh = np.zeros(d_h)

Watt = rng.normal(size=(d_h, d_h)) / np.sqrt(d_h)
Wdec = rng.normal(size=(d_emb + d_h, d_h)) / np.sqrt(d_emb + d_h)
Ws = rng.normal(size=(d_h, d_h)) / np.sqrt(d_h)
bd = np.zeros(d_h)
Wo = rng.normal(size=(2 * d_h, vocab_out)) / np.sqrt(2 * d_h)
bo = np.zeros(vocab_out)

# Encoder: produce one hidden state per source token.
h = np.zeros(d_h)
encoder_states = []
for token in source:
    x_i = E_src[token]
    h = np.tanh(x_i @ Wxh + h @ Whh + bh)
    encoder_states.append(h)
encoder_states = np.stack(encoder_states)  # [T, d_h]

# Decoder: teacher forcing plus attention at every output step.
s = encoder_states[-1]
logits = []
alignments = []
for token in target_in:
    y_prev = E_tgt[token]
    scores = encoder_states @ Watt @ s
    alpha = softmax(scores)
    context = alpha @ encoder_states

    decoder_input = np.concatenate([y_prev, context])
    s = np.tanh(decoder_input @ Wdec + s @ Ws + bd)
    logits_t = np.concatenate([s, context]) @ Wo + bo

    logits.append(logits_t)
    alignments.append(alpha)

print("encoder states:", encoder_states.shape)
print("decoder logits:", np.stack(logits).shape)
print("attention weights for step 2:", alignments[1].round(3))

04 / Case

Case: translating one research language into another

  • The classic Seq2Seq use case is machine translation, but in a StatsPAI setting it can translate a research task into draft code, a regression table into prose, or reviewer comments into a revision checklist.
  • Suppose the input is four reviewer-comment blocks: data source, identification assumptions, robustness, and writing structure. The encoder creates one state per block; the decoder generates each response-letter sentence.
  • Without attention, all evidence is compressed into one fixed vector and later sentences may forget early comments. With attention, each generated sentence can realign to the relevant comment block.
  • For empirical-research assistants, the alignment heatmap becomes an audit trace: when a rebuttal sentence is generated, which comment or result table did the model mainly route through?

05 / Risks

Common Pitfalls

Assuming a fixed context vector is enough for long inputs. Long text, tables, and multi-block comments usually need attention or retrieval.
Ignoring the distribution gap between teacher forcing and autoregressive inference, which can produce low training loss but poor generation.
Making beam search too wide; larger beams can amplify generic short outputs, repetition, or length bias.
Forgetting to mask padding tokens, letting attention place probability mass on non-existent source positions.
Treating an attention heatmap as causal explanation. It helps audit routing but does not replace identification assumptions or human verification.
In research automation, optimizing fluency without verifying citations, data, code, and numerical results.

References