{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 6: Deep Q-Networks\n", "\n", "In this assignment you will implement deep q-learning and test this algorithm on the Frozen-lake environment. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import numpy.random as npr\n", "import random\n", "import matplotlib.pyplot as plt\n", "import copy\n", "from collections import defaultdict, namedtuple\n", "from itertools import count\n", "from more_itertools import windowed\n", "from tqdm import tqdm\n", "\n", "import tensorflow as tf\n", "import tensorflow.contrib.slim as slim\n", "\n", "from gym.envs.toy_text.frozen_lake import FrozenLakeEnv\n", "\n", "plt.style.use('ggplot')\n", "\n", "# Remove this if you want to use a GPU\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $\\epsilon$-Greedy Decay" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class LinearSchedule(object): \n", " def __init__(self, schedule_timesteps, final_p, initial_p=1.0): \n", " '''\n", " Linear interpolation between initial_p and final_p over \n", " schedule_timesteps. After this many timesteps pass final_p is \n", " returned. \n", " \n", " Args: \n", " - schedule_timesteps: Number of timesteps for which to linearly anneal initial_p to final_p \n", " - initial_p: initial output value \n", " -final_p: final output value \n", " ''' \n", " self.schedule_timesteps = schedule_timesteps \n", " self.final_p = final_p \n", " self.initial_p = initial_p \n", " \n", " def value(self, t): \n", " fraction = min(float(t) / self.schedule_timesteps, 1.0) \n", " return self.initial_p + fraction * (self.final_p - self.initial_p) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replay Buffer" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))\n", "\n", "class ReplayMemory(object):\n", " def __init__(self, size):\n", " '''\n", " Replay buffer used to store transition experiences. Experiences will be removed in a \n", " FIFO manner after reaching maximum buffer size.\n", " \n", " Args:\n", " - size: Maximum size of the buffer.\n", " '''\n", " self.size = size\n", " self.memory = list()\n", " self.idx = 0\n", " \n", " def add(self, *args):\n", " if len(self.memory) < self.size:\n", " self.memory.append(None)\n", " self.memory[self.idx] = Transition(*args)\n", " self.idx = (self.idx + 1) % self.size\n", " \n", " def sample(self, batch_size):\n", " return random.sample(self.memory, batch_size)\n", " \n", " def __len__(self):\n", " return len(self.memory)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1 (\\# Points):\n", "\n", "Implement the necessary Tensorflow operations for a Deep Q-Network. This should include the following:\n", "1. Any number of fully connected layers which take the state as input and outputs the q-values for each possible action. You should only need a few small fully conected layers as the environments are relatively simple.\n", "2. A prediction operation which returns the index of the best action.\n", "3. Operations to compute the loss.\n", "4. A optimizer to minimize the loss, i.e. SGD, RMSProp, Adam. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class DQN(object):\n", " def __init__(self, state_shape, action_shape, lr=0.001):\n", " '''\n", " Deep Q-Network Tensorflow model.\n", " \n", " Args:\n", " - state_shape: Input state shape \n", " - action_shape: Output action shape\n", " '''\n", " self.input = tf.placeholder(shape=[None, state_shape], dtype=tf.float32)\n", " self.fc1 = slim.fully_connected(self.input, 32)\n", " self.fc2 = slim.fully_connected(self.fc1, 64)\n", " self.q_head = slim.fully_connected(self.fc2, action_shape, activation_fn=None)\n", " self.predict = tf.argmax(self.q_head, 1)\n", " \n", " self.target_Q = tf.placeholder(shape=[None],dtype=tf.float32)\n", " self.actions = tf.placeholder(shape=[None],dtype=tf.int32)\n", " self.actions_onehot = tf.one_hot(self.actions, action_shape, dtype=tf.float32)\n", " \n", " self.Q = tf.reduce_sum(tf.multiply(self.q_head, self.actions_onehot), axis=1)\n", " \n", " self.td_error = tf.square(self.target_Q - self.Q)\n", " self.loss = tf.reduce_mean(self.td_error)\n", " self.trainer = tf.train.AdamOptimizer(learning_rate=lr)\n", " self.updateModel = self.trainer.minimize(self.loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2 (\\# Points):\n", "\n", "Implement the following method which will be used when optimizing the model. The optimize_model method should compute the target Q-values from the batch and use the optimizer op created in the previous exercise. **Note:** We are using a target network to compute the target q-values in order to stabalize training." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def optimize_model(session, policy_net, target_net, batch, gamma):\n", " '''\n", " Calculates the target Q-values for the given batch and uses them to update the model.\n", " \n", " Args:\n", " - session: Tensorflow session\n", " - policy_net: Policy DQN model\n", " - target_net: DQN model used to generate target Q-values\n", " - batch: Batch of experiences uesd to optimize model\n", " - gamma: Discount factor\n", " '''\n", " batch_size = len(batch[0])\n", " not_done_masks = np.logical_not(batch.done)\n", " not_done_next_states = np.expand_dims(batch.next_state, axis=1)[not_done_masks]\n", " \n", " next_state_q = np.zeros(batch_size)\n", " next_state_q[not_done_masks] = session.run(target_net.q_head, \n", " feed_dict={target_net.input:not_done_next_states}).max(axis=1)\n", " expected_q = batch.reward + (gamma * next_state_q)\n", " \n", " session.run(policy_net.updateModel, feed_dict={policy_net.input:np.vstack(batch.state),\n", " policy_net.target_Q:expected_q, \n", " policy_net.actions:batch.action})\n", "\n", "def update_target_graph_op(tf_vars, tau):\n", " '''\n", " Creates a Tensorflow op which updates the target model towards the policy model by a small amount.\n", " \n", " Args:\n", " - tf_vars: All trainable variables in the Tensorflow graph\n", " - tau: Amount to update the target model\n", " '''\n", " total_vars = len(tf_vars)\n", " update_ops = list()\n", " for idx,var in enumerate(tf_vars[0:total_vars//2]):\n", " op = tf_vars[idx + total_vars//2].assign((var.value()*tau) + \\\n", " ((1-tau)*tf_vars[idx+total_vars//2].value()))\n", " update_ops.append(op)\n", " return update_ops\n", "\n", "def update_target(session, update_ops):\n", " '''\n", " Calls each update op to update the target model.\n", " \n", " Args:\n", " - session: Tensorflow session\n", " - update_ops: The update ops which moves the target model towards the policy model\n", " '''\n", " for op in update_ops:\n", " session.run(op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3 (\\# Points):\n", "\n", "Implement the below method to train the model in the given environment. You should choose actions in a $\\epsilon$-greedy fashion while annealing $\\epsilon$ over time as we did in the previous assignment." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def train(env, num_episodes=500, gamma=0.99, batch_size=64,\n", " annealing_steps=1000, s_epsilon=1.0, f_epsilon=0.1):\n", " '''\n", " DQN algorithm\n", " \n", " Args:\n", " - env: The environment to train the agent on\n", " - num_episodes: The number of episodes to train the agent for\n", " - gamma: The discount factor\n", " - batch_size: Number of experiences in a batch\n", " - annealing_steps: The number of steps to anneal epsilon over\n", " - s_epsilon: The initial epsilon value for e-greedy action selection\n", " - f_epsilon: The final epsilon value for the e-greedy action selection\n", " \n", " Returns: (policy_net, episode_rewards)\n", " - policy_net: Trained DQN model\n", " - episode_rewards: Numpy array containing the reward of each episode during training\n", " '''\n", " policy_net = DQN(1, env.action_space.n)\n", " target_net = DQN(1, env.action_space.n)\n", " target_ops = update_target_graph_op(tf.trainable_variables(), 0.1)\n", " \n", " memory = ReplayMemory(20000)\n", " epsilon = LinearSchedule(annealing_steps, f_epsilon, s_epsilon)\n", "\n", " pbar = tqdm(range(num_episodes))\n", " pbar.set_description('Steps: 0 | Reward: 0.0 | Epsilon: {}'.format(s_epsilon))\n", " \n", " steps_done = 0\n", " episode_rewards = list()\n", " \n", " session = tf.Session()\n", " session.run(tf.global_variables_initializer())\n", " for i_episode in pbar:\n", " # Initialize the environment and state\n", " state = env.reset()\n", " total_reward = 0\n", " for t in count():\n", " # Select and perform an action\n", " if npr.rand() > epsilon.value(i_episode):\n", " action = session.run(policy_net.predict, feed_dict={policy_net.input:[[state]]})[0]\n", " else:\n", " action = npr.choice(env.action_space.n)\n", " \n", " next_state, reward, done, _ = env.step(action)\n", " total_reward += reward\n", "\n", " # Store the transition in memory\n", " memory.add(state, action, next_state, reward, int(done))\n", "\n", " # Move to the next state\n", " state = copy.copy(next_state)\n", " steps_done += 1\n", "\n", " # Perform one step of the optimization\n", " if len(memory) > batch_size:\n", " transitions = memory.sample(batch_size)\n", " batch = Transition(*zip(*transitions))\n", " optimize_model(session, policy_net, target_net, batch, gamma)\n", " \n", " # Update the target network\n", " update_target(session, target_ops)\n", " \n", " # Check if the episode is over and break if it is\n", " if done:\n", " episode_rewards.append(total_reward)\n", " break\n", " \n", " # Update progress bar\n", " pbar.set_description('Steps: {} | Reward: {} | Epsilon: {:.2f}' \\\n", " .format(t, episode_rewards[-1], epsilon.value(i_episode)))\n", " return policy_net, episode_rewards" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test your implentation on the Frozen-lake (both slippery and not slippery) environment. You should plot the episode rewards over time. It might be helpful to smooth this curve over a time window of 100 episodes in order to get a more clear picture of the learning process." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Steps: 5 | Reward: 1.0 | Epsilon: 0.00: 100%|██████████| 2000/2000 [01:13<00:00, 27.23it/s] \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "frozen_lake_env = FrozenLakeEnv(desc=None, map_name=\"4x4\",is_slippery=False)\n", "\n", "policy_net, rewards = train(frozen_lake_env, batch_size=64, num_episodes=2000, annealing_steps=100, f_epsilon=0.0, gamma=0.9)\n", "avg_reward = np.mean(list(windowed(rewards, 100)), axis=1)\n", "plt.plot(avg_reward)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }