Overview
My MSc AI final project. A Deep Q-Network agent that learns to play Snake from scratch. The agent sees the world through an 11-feature state vector, stores past experiences in a replay buffer, and uses epsilon-greedy exploration to decide when to try something new versus exploit what it already knows. I ran 12 experiments varying network width (256 vs 512), depth, memory buffer sizes (10K to 200K), and a wall-collision variant of the environment.
The Problem
Build an agent that figures out how to play Snake without me hard-coding any rules. Rewards are sparse (food only), the agent has to think more than one step ahead to avoid trapping itself, and there's the usual exploration vs exploitation trade-off to manage during training.
The Approach
Deep Q-Learning with experience replay in PyTorch. The network takes the 11-feature state (danger in 3 directions, current heading, food position) and outputs Q-values for 3 actions (straight, turn left, turn right). Q-values get updated via the Bellman equation, and exploration decays over time. I ran the 12 experiments to actually see what mattered, rather than guessing.
Outcome
Best configuration hit consistent scores of 40+ after 200 training episodes. A few clear findings: wider beat deeper for this task, and bigger replay buffers stabilised learning. The wall-collision variant was noticeably harder and needed different hyperparameters. Everything is documented with training curves, architecture comparisons, and statistical summaries.
