Policy Gradient vs. Double DQN: a head-to-head comparison of two reinforcement learning approaches on a deceptively complex game.
Connect-4 sounds simple. Drop pieces, get four in a row, win.
But the state space contains roughly 4.5 trillion reachable positions. You can’t brute-force it at game speed. You need something that actually learns — and that’s where reinforcement learning gets interesting.
We built two RL agents from scratch, let them fight each other and a baseline CNN model, and then ran a full round-robin tournament to find out which architecture actually wins.
The result surprised us.
Why Not Start From Zero?
Most RL tutorials begin with random initialization and watch the agent flail for thousands of episodes. We took a smarter approach: warm-start both agents from a pretrained CNN (M1) that had already learned to mimic Monte Carlo Tree Search move distributions.
The result is an agent that walks into RL training already knowing the basics of Connect-4. Policy Gradient just needs to shift probability toward winning moves. DDQN can skip the “learning that diagonal blocks exist” phase entirely.
Starting smart is dramatically more efficient than starting random.
Agent 1: Policy Gradient (REINFORCE)
The PG approach is conceptually clean: after each game, increase the probability of moves that appeared in wins, and decrease those from losses. No value estimation. No replay buffer. Just the policy, games, and gradient updates.
The Problem We Ran Into
The first version we ran without regularization failed badly. By iteration 200, the policy had collapsed — the agent was playing the same move in almost every situation. Entropy had dropped to near zero and loss had blown past -3.4.
The fix was entropy regularization: adding a bonus term that penalizes overconfident distributions.
total_loss = pg_loss - 0.01 * tf.reduce_mean(entropy)With α = 0.01, policy entropy decayed gradually from ~1.0 to ~0.25 over 2,000 iterations rather than collapsing in the first hundred. This was the single most important change we made. Everything else was secondary.
Self-Play With a Twist
Rather than training against a static opponent, we maintained an opponent pool that grows over time. Every 200 iterations, if the current agent beats a pool member more than 55% of the time, it gets snapped into the pool as a new opponent.
This prevents the agent from exploiting weak opponents and stops improvement. The gate threshold (55%) was well-calibrated: tight enough to filter regressions, loose enough to let genuine progress through.
Agent 2: Double Deep Q-Network (DDQN)
DDQN takes a fundamentally different angle. Instead of optimizing the policy directly, it learns a Q-function — an estimate of expected future reward for every (state, action) pair — and derives moves greedily from those values.
The “Double” part fixes a known bias in standard DQN: using the same network to both select and evaluate the best next action inflates Q-value estimates. DDQN decouples these two steps across an online and a target network.
Three Design Decisions Worth Knowing
1. tanh over softmax
The final activation is tanh (range −1 to +1) rather than softmax. Q-values are independent estimates — forcing them to sum to 1 via softmax is mathematically wrong.
2. Define next-state correctly
We define s’ as the board after the opponent has also replied, not just after the DQN’s own move. This makes the MDP well-formed and prevents the Bellman update from incorrectly treating winning moves as non-terminal.
3. Custom masked loss
Standard MSE would push Q-values for unplayed columns toward zero on every update. We only have information about the column that was actually played — so we mask everything else:
Training Metrics Said One Thing. The Tournament Said Another.
By training metrics, PG looked dominant: 73.2% win rate vs. the base model, stable entropy, faster convergence. DDQN was still improving after 9,000 games and posted only a 21.5% pool win rate.
Then we ran the round-robin tournament.
| Rank | Model | Wins | Win % |
|---|---|---|---|
| 🥇 1st | DDQN | 349 | 58.2% |
| 🥈 2nd | Policy Gradient | 330 | 55.0% |
| 3rd | Base CNN | 284 | 47.3% |
| 4th | Original M1 | 237 | 39.5% |
DDQN finished first. Despite looking weaker on every training metric.
Why?
Training win rate and tournament win rate measure fundamentally different things. PG fine-tuned against a narrow set of pool opponents and may have developed patterns that work well against them specifically. DDQN, by learning a value function from a diverse replay buffer, generalized better across different opponent styles.
The takeaway: never use training metrics as a proxy for real performance. Always simulate the actual evaluation condition.
The Bigger Picture
Both agents significantly outperformed the CNN baseline — evidence that RL is worth the investment, even for games that might seem “solved.”
Policy Gradient is faster, more stable, and easier to reason about. DDQN is slower to converge but generalizes better in adversarial settings. The right choice depends on your compute budget and how diverse your opponent pool is likely to be.
For Connect-4, DDQN wins. For games with shorter horizons and richer feedback signals, PG might be the better call.