{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 6: Deep Q-Networks\n", "\n", "In this assignment you will implement deep q-learning and test this algorithm on the Frozen-lake environment. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import numpy.random as npr\n", "import random\n", "import matplotlib.pyplot as plt\n", "import copy\n", "from collections import defaultdict, namedtuple\n", "from itertools import count\n", "from more_itertools import windowed\n", "from tqdm import tqdm\n", "\n", "import tensorflow as tf\n", "import tensorflow.contrib.slim as slim\n", "\n", "from gym.envs.toy_text.frozen_lake import FrozenLakeEnv\n", "\n", "plt.style.use('ggplot')\n", "\n", "# Remove this if you want to use a GPU\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $\\epsilon$-Greedy Decay" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class LinearSchedule(object): \n", " def __init__(self, schedule_timesteps, final_p, initial_p=1.0): \n", " '''\n", " Linear interpolation between initial_p and final_p over \n", " schedule_timesteps. After this many timesteps pass final_p is \n", " returned. \n", " \n", " Args: \n", " - schedule_timesteps: Number of timesteps for which to linearly anneal initial_p to final_p \n", " - initial_p: initial output value \n", " -final_p: final output value \n", " ''' \n", " self.schedule_timesteps = schedule_timesteps \n", " self.final_p = final_p \n", " self.initial_p = initial_p \n", " \n", " def value(self, t): \n", " fraction = min(float(t) / self.schedule_timesteps, 1.0) \n", " return self.initial_p + fraction * (self.final_p - self.initial_p) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replay Buffer" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))\n", "\n", "class ReplayMemory(object):\n", " def __init__(self, size):\n", " '''\n", " Replay buffer used to store transition experiences. Experiences will be removed in a \n", " FIFO manner after reaching maximum buffer size.\n", " \n", " Args:\n", " - size: Maximum size of the buffer.\n", " '''\n", " self.size = size\n", " self.memory = list()\n", " self.idx = 0\n", " \n", " def add(self, *args):\n", " if len(self.memory) < self.size:\n", " self.memory.append(None)\n", " self.memory[self.idx] = Transition(*args)\n", " self.idx = (self.idx + 1) % self.size\n", " \n", " def sample(self, batch_size):\n", " return random.sample(self.memory, batch_size)\n", " \n", " def __len__(self):\n", " return len(self.memory)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1 (\\# Points):\n", "\n", "Implement the necessary Tensorflow operations for a Deep Q-Network. This should include the following:\n", "1. Any number of fully connected layers which take the state as input and outputs the q-values for each possible action. You should only need a few small fully conected layers as the environments are relatively simple.\n", "2. A prediction operation which returns the index of the best action.\n", "3. Operations to compute the loss.\n", "4. A optimizer to minimize the loss, i.e. SGD, RMSProp, Adam. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class DQN(object):\n", " def __init__(self, state_shape, action_shape, lr=0.001):\n", " '''\n", " Deep Q-Network Tensorflow model.\n", " \n", " Args:\n", " - state_shape: Input state shape \n", " - action_shape: Output action shape\n", " '''\n", " self.input = tf.placeholder(shape=[None, state_shape], dtype=tf.float32)\n", " self.fc1 = slim.fully_connected(self.input, 32)\n", " self.fc2 = slim.fully_connected(self.fc1, 64)\n", " self.q_head = slim.fully_connected(self.fc2, action_shape, activation_fn=None)\n", " self.predict = tf.argmax(self.q_head, 1)\n", " \n", " self.target_Q = tf.placeholder(shape=[None],dtype=tf.float32)\n", " self.actions = tf.placeholder(shape=[None],dtype=tf.int32)\n", " self.actions_onehot = tf.one_hot(self.actions, action_shape, dtype=tf.float32)\n", " \n", " self.Q = tf.reduce_sum(tf.multiply(self.q_head, self.actions_onehot), axis=1)\n", " \n", " self.td_error = tf.square(self.target_Q - self.Q)\n", " self.loss = tf.reduce_mean(self.td_error)\n", " self.trainer = tf.train.AdamOptimizer(learning_rate=lr)\n", " self.updateModel = self.trainer.minimize(self.loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2 (\\# Points):\n", "\n", "Implement the following method which will be used when optimizing the model. The optimize_model method should compute the target Q-values from the batch and use the optimizer op created in the previous exercise. **Note:** We are using a target network to compute the target q-values in order to stabalize training." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def optimize_model(session, policy_net, target_net, batch, gamma):\n", " '''\n", " Calculates the target Q-values for the given batch and uses them to update the model.\n", " \n", " Args:\n", " - session: Tensorflow session\n", " - policy_net: Policy DQN model\n", " - target_net: DQN model used to generate target Q-values\n", " - batch: Batch of experiences uesd to optimize model\n", " - gamma: Discount factor\n", " '''\n", " batch_size = len(batch[0])\n", " not_done_masks = np.logical_not(batch.done)\n", " not_done_next_states = np.expand_dims(batch.next_state, axis=1)[not_done_masks]\n", " \n", " next_state_q = np.zeros(batch_size)\n", " next_state_q[not_done_masks] = session.run(target_net.q_head, \n", " feed_dict={target_net.input:not_done_next_states}).max(axis=1)\n", " expected_q = batch.reward + (gamma * next_state_q)\n", " \n", " session.run(policy_net.updateModel, feed_dict={policy_net.input:np.vstack(batch.state),\n", " policy_net.target_Q:expected_q, \n", " policy_net.actions:batch.action})\n", "\n", "def update_target_graph_op(tf_vars, tau):\n", " '''\n", " Creates a Tensorflow op which updates the target model towards the policy model by a small amount.\n", " \n", " Args:\n", " - tf_vars: All trainable variables in the Tensorflow graph\n", " - tau: Amount to update the target model\n", " '''\n", " total_vars = len(tf_vars)\n", " update_ops = list()\n", " for idx,var in enumerate(tf_vars[0:total_vars//2]):\n", " op = tf_vars[idx + total_vars//2].assign((var.value()*tau) + \\\n", " ((1-tau)*tf_vars[idx+total_vars//2].value()))\n", " update_ops.append(op)\n", " return update_ops\n", "\n", "def update_target(session, update_ops):\n", " '''\n", " Calls each update op to update the target model.\n", " \n", " Args:\n", " - session: Tensorflow session\n", " - update_ops: The update ops which moves the target model towards the policy model\n", " '''\n", " for op in update_ops:\n", " session.run(op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3 (\\# Points):\n", "\n", "Implement the below method to train the model in the given environment. You should choose actions in a $\\epsilon$-greedy fashion while annealing $\\epsilon$ over time as we did in the previous assignment." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def train(env, num_episodes=500, gamma=0.99, batch_size=64,\n", " annealing_steps=1000, s_epsilon=1.0, f_epsilon=0.1):\n", " '''\n", " DQN algorithm\n", " \n", " Args:\n", " - env: The environment to train the agent on\n", " - num_episodes: The number of episodes to train the agent for\n", " - gamma: The discount factor\n", " - batch_size: Number of experiences in a batch\n", " - annealing_steps: The number of steps to anneal epsilon over\n", " - s_epsilon: The initial epsilon value for e-greedy action selection\n", " - f_epsilon: The final epsilon value for the e-greedy action selection\n", " \n", " Returns: (policy_net, episode_rewards)\n", " - policy_net: Trained DQN model\n", " - episode_rewards: Numpy array containing the reward of each episode during training\n", " '''\n", " policy_net = DQN(1, env.action_space.n)\n", " target_net = DQN(1, env.action_space.n)\n", " target_ops = update_target_graph_op(tf.trainable_variables(), 0.1)\n", " \n", " memory = ReplayMemory(20000)\n", " epsilon = LinearSchedule(annealing_steps, f_epsilon, s_epsilon)\n", "\n", " pbar = tqdm(range(num_episodes))\n", " pbar.set_description('Steps: 0 | Reward: 0.0 | Epsilon: {}'.format(s_epsilon))\n", " \n", " steps_done = 0\n", " episode_rewards = list()\n", " \n", " session = tf.Session()\n", " session.run(tf.global_variables_initializer())\n", " for i_episode in pbar:\n", " # Initialize the environment and state\n", " state = env.reset()\n", " total_reward = 0\n", " for t in count():\n", " # Select and perform an action\n", " if npr.rand() > epsilon.value(i_episode):\n", " action = session.run(policy_net.predict, feed_dict={policy_net.input:[[state]]})[0]\n", " else:\n", " action = npr.choice(env.action_space.n)\n", " \n", " next_state, reward, done, _ = env.step(action)\n", " total_reward += reward\n", "\n", " # Store the transition in memory\n", " memory.add(state, action, next_state, reward, int(done))\n", "\n", " # Move to the next state\n", " state = copy.copy(next_state)\n", " steps_done += 1\n", "\n", " # Perform one step of the optimization\n", " if len(memory) > batch_size:\n", " transitions = memory.sample(batch_size)\n", " batch = Transition(*zip(*transitions))\n", " optimize_model(session, policy_net, target_net, batch, gamma)\n", " \n", " # Update the target network\n", " update_target(session, target_ops)\n", " \n", " # Check if the episode is over and break if it is\n", " if done:\n", " episode_rewards.append(total_reward)\n", " break\n", " \n", " # Update progress bar\n", " pbar.set_description('Steps: {} | Reward: {} | Epsilon: {:.2f}' \\\n", " .format(t, episode_rewards[-1], epsilon.value(i_episode)))\n", " return policy_net, episode_rewards" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test your implentation on the Frozen-lake (both slippery and not slippery) environment. You should plot the episode rewards over time. It might be helpful to smooth this curve over a time window of 100 episodes in order to get a more clear picture of the learning process." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Steps: 5 | Reward: 1.0 | Epsilon: 0.00: 100%|██████████| 2000/2000 [01:13<00:00, 27.23it/s] \n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAGyJJREFUeJzt3XuYXHWd5/H36e7cLwQoBtIJLIgBDZflHlm8gAIbHCaso8/XhPGC4xhnBEcfZ3cHVJAH94/gLDA8LqOTRQR2GPCrrg6L4eaggmIgEkQJ4RpuSSBJ59LVnVunU7/945wk1Z1LV1dX1Tl16vN6nn6qzqlTVZ+cTn/69K/OJQohICIi+dKWdgAREak9lbuISA6p3EVEckjlLiKSQyp3EZEcUrmLiOSQyl1EJIdU7iIiOaRyFxHJoY4U31uHxoqIVCcaaoE0y53Vq1dX9bxCoUBXV1eN09SO8o2M8o1c1jMqX/U6OzsrWk7DMiIiOaRyFxHJIZW7iEgOqdxFRHJI5S4ikkND7i1jZrcBFwNr3f3EfTweATcDHwa2AJe5+9JaBxURkcpVsuV+OzD7AI9fBMxIvuYD3xl5LBERGYkht9zd/VEzO/oAi1wC3OnuAVhsZlPMbKq7v1WrkJJdpSd+BW+vrOlr9o4bT2nrlpq+Zi1lPR9kP2Or54tOPovomBl1e32ozUFM04A3y6ZXJvP2Knczm0+8dY+7UygUqnrDjo6Oqp/bCK2SL+zcydrbboJSCaIhD5ir2OaavVJ9ZD0fZD9jq+ebOO0oxp95dl3fo6FHqLr7QmBhMhmqPQIsy0ePQf7yhZ5uwiP3QX//wAd29EGpRDRvPm0fvDi1fI2W9XyQ/Yytnm8LsKXK16/0CNValPsq4Miy6enJPMmJ8NRvCPf9ADo62OuUFuPGEx11bCq5RGT/alHu9wJXmNk9wCygW+Pt+RF6i4Qf3wFA27edqCPV0xGJSIUq2RXybuBcoGBmK4FvAKMA3P27wCLi3SBfJv5r4zP1CiuNF/6wBLZthc6jVOwiTaSSvWXmDfF4AC6vWSLJlp5uANqu+lbKQURkOHSEqhxYTzd0jIIx49JOIiLDoL+zZZ/Cay8RHnuY8OKzMHEyUQ13dRSR+tOWu+xT+MUiwmMPwdbNRCefkXYcERkmbbnLXkIIhMf/HY46lvarb0o7johUQVvusrdNG+Lb8RPSzSEiVVO5y9527SFz3odTDiIi1VK5y956i/HtxIPSzSEiVVO5y15CsuXOJJW7SLNSucvedm25T5qcbg4RqZrKXQYIIRDuuyeeGD8x3TAiUjWVuwy0pRd6e+DgAlGb/nuINCv99MpAyXh79NFPpxxEREZC5S4D9cTj7ZHG20WamspdBghLHo3vTFS5izQzlbsMENa9Hd85Ynq6QURkRFTuslt45Xl4dimcdAbR6DFpxxGREVC5y27hD0sAiM4+L+UkIjJSKnfZo6cbJk+h7cz3pZ1EREZI5S67hd6iTjkgkhMqdwEgrHgBnl6svWREckLlLgCE534PQPS+C1NOIiK1oHKXWG8Rxo2nbdYH0k4iIjWgcpdYT7eGZERyROUugD5MFckblbvEtOUukisqd4n1FHWyMJEcUbkLIQTo7dY1U0VyROUusG0r9PdrzF0kR1TusvsCHRpzF8kPlXuLC8WNlL72eUAX6BDJk45KFjKz2cDNQDtwq7svGPT4UcAdwJRkmSvdfVGNs0o9rHojvn3nu+H4k9LNIiI1M+SWu5m1A7cAFwEzgXlmNnPQYl8H3N1PBeYC/1TroFJ74anHKd14NQBtn7icaMzYlBOJSK1UMixzFvCyu69w9z7gHuCSQcsEYNff9AcBq2sXUeolLH18z8SkSekFEZGaq2RYZhrwZtn0SmDWoGWuBR4ysy8CE4Dza5JOGmeCxttF8qSiMfcKzANud/cbzOxs4P+Y2YnuXipfyMzmA/MB3J1CoVDVm3V0dFT93EZolnwbd/bTB0z+26sZd/jhacfarVnWX5ZlPaPy1V8l5b4KOLJsenoyr9xngdkA7v5bMxsLFIC15Qu5+0JgYTIZurq6qslMoVCg2uc2QrPk27mhC2aeyuaTzmRzhvI2y/rLsqxnVL7qdXZ2VrRcJeW+BJhhZscQl/pc4NJBy7wBfAi43czeDYwF1lWcVtLRvZHosKlppxCROhjyA1V37weuAB4ElsezfJmZXWdmc5LF/g74nJk9A9wNXObuoV6hZeRK9/8YNqyDyVPSjiIidVDRmHuyz/qiQfOuKbv/HHBObaNJPYXXXwIg+tDFKScRkXrQEaqtqqcIM2YSFbLzQaqI1I7KvVXp/O0iuaZyb1W9RSKdBVIkt1TuLSiUSrC5R+dvF8kxlXsLCr09UCqBzgIpklsq9xZUKm6M72jMXSS3VO4tqFSML86hMXeR/FK5t6BScVN8R8MyIrmlcm9BGpYRyT+VewsKybCMLogtkl8q9xZUKm6CMeOIRo1OO4qI1InKvQWVujdqvF0k51TuLahU3KTxdpGcU7m3oFKxW+PtIjmncm9BpeImIm25i+Sayr3FhBAoda3RlrtIzqncW83qN+Lb0dpTRiTPVO4tJJRKhN/9BoDoXSennEZE6knl3kqWLSXcd098XxfGFsk1lXsLCevXAjDlGzcRHVJIOY2I1JPKvZX0FAEYfcJpKQcRkXpTubeS3iKMm0A0alTaSUSkzlTuraSnW6cdEGkRKvcWEl5cpv3bRVqEyr1FhA1d0L0BIn3LRVqBftJbxZpVALSdPyflICLSCCr3FhFeWR7fOWJaukFEpCFU7q1i29b4tvOodHOISEOo3FtFTxGmHEoURWknEZEGULm3iNBb1G6QIi1E5d4qenSBDpFW0lHJQmY2G7gZaAdudfcF+1jGgGuBADzj7pfWMKeMVG+RSCcLE2kZQ265m1k7cAtwETATmGdmMwctMwO4CjjH3U8AvlyHrDISOjpVpKVUMixzFvCyu69w9z7gHuCSQct8DrjF3TcCuPva2saUkQg7dsR7y+jSeiIto5JhmWnAm2XTK4FZg5Y5DsDMfkM8dHOtuz8w+IXMbD4wH8DdKRSqO+1sR0dH1c9thKzl27l+HV3AxKnTGF8oZC7fYMo3clnPqHz1V9GYe4WvMwM4F5gOPGpmJ7n7pvKF3H0hsDCZDF1dXVW9WaFQoNrnNkLW8oU3XgVgM21s6erKXL7BlG/ksp5R+arX2dlZ0XKVDMusAo4sm56ezCu3ErjX3Xe4+6vAi8RlL1nQG5/HXWPuIq2jki33JcAMMzuGuNTnAoP3hPkpMA/4vpkViIdpVtQyqFQvPPlofEe7Qoq0jCG33N29H7gCeBBYHs/yZWZ2nZntOgvVg8B6M3sO+AXw39x9fb1Cy/CEpxfHdw5u7jFEEalcRWPu7r4IWDRo3jVl9wPwleRLMiS8sQK29BJd9DGisePSjiMiDaIjVHOu9M3kkIPJGpIRaSUq91YxUeUu0kpU7jkW+rbvvh9NmJRiEhFpNJV7nvUU99xv07dapJXoJz7PesvK/eh3ppdDRBpO5Z5j4fWXAGj7+wUalhFpMSr3POtKzt92xPR0c4hIw6nc86y3CJMOItLZIEVajso9x0JS7iLSelTuebZsqcpdpEWp3HMqlHZCXx+0t6cdRURSoHLPq94eAKKTz0o5iIikQeWeVzqHu0hLU7nnVXJ0aqQxd5GWpHLPq97u+FZb7iItSeWeQ+GFZyl99/p4Qvu4i7QklXsOlX5y556JyVPSCyIiqVG559Erz8e3Bx1C1KZdIUVakco9j0aPAaDtk5enHERE0qJyz5nw1pvQt51ozqVE//HMtOOISEpU7jkTnv8jANE7jk85iYikSeWeNz3JLpDvOjndHCKSKpV73vR2w4RJRDqnjEhLU7nnTHjmSe3bLiIq99zp74eOjrRTiEjKVO45ErZtgeImopO1l4xIq1O558mqN+Lbgw5ON4eIpE7lnifJaX6jY9+VchARSZvKPUfCrt0gdZpfkZancs+R8Mh98R3tLSPS8irarcLMZgM3A+3Are6+YD/LfRT4EXCmu/+uZimlMlu3wNhxRGPGpp1ERFI25Ja7mbUDtwAXATOBeWY2cx/LTQK+BDxR65BSod4i0XsvSDuFiGRAJcMyZwEvu/sKd+8D7gEu2cdy3wSuB7bVMJ9UKOzYAdu2akhGRIDKyn0a8GbZ9Mpk3m5mdhpwpLv/rIbZZDj0YaqIlBnxoYxm1gbcCFxWwbLzgfkA7k6hUKjqPTs6Oqp+biOkkW9HcT0bgMmd0xk7xHtr/Y1M1vNB9jMqX/1VUu6rgCPLpqcn83aZBJwI/NLMAI4A7jWzOYM/VHX3hcDCZDJ0dXVVFbpQKFDtcxshjXxhZXwAU0+I6B3ivbX+Ribr+SD7GZWvep2dnRUtV0m5LwFmmNkxxKU+F7h014Pu3g3s/hVnZr8E/qv2lmms0BMfwMQkjbmLSAVj7u7eD1wBPAgsj2f5MjO7zszm1DugDC2se5tw6w3xhMbcRYQKx9zdfRGwaNC8a/az7LkjjyXD8uarAETvvYBowqSUw4hIFugI1RwIu84p82fzUk4iIlmhcs8D7QYpIoOo3POgtxifdmDUqLSTiEhGqNzzoKdbR6aKyAAq9xwIvUUNyYjIACr3POgpastdRAZQuedBbzeRttxFpIzKvcmFELTlLiJ7Ubk3uXDXd2BHn047ICIDqNybXPjVA/GdsePSDSIimaJyz4sJ2nIXkT1U7s1u+tEARKfOSjeHiGSKyr3Z9fURnfk+og4dnSoie6jcm12vjk4Vkb2p3JtY6O+HLZtV7iKyF5V7M9vcE9/qACYRGUTl3sySU/1G2sddRAZRuTepEALhlefjCW25i8ggKvdm9eIywr/8U3xfY+4iMojKvQmF9WsJL/xhzwwNy4jIIBVdIFuyI2zbSunrfwP9O/bM1NGpIjKIyr3ZbFoP/TuILvwI0QmnwuQpRO3taacSkYxRuTeZsDwejonedTLRzFNSTiMiWaUx92azbWt8O2NmujlEJNNU7s2mtxtGjybSKX5F5ABU7s2mpxsmar92ETkwlXuTCbqknohUQOXebHqL2q9dRIakcm82Pd1EOt2AiAxB5d5seosacxeRIancm0jo2w7bt8HESWlHEZGMq+ggJjObDdwMtAO3uvuCQY9/BfgroB9YB/ylu79e46zSW4xvNSwjIkMYcsvdzNqBW4CLgJnAPDMbfATN08AZ7n4y8CPgW7UOKkBPXO4acxeRoVSy5X4W8LK7rwAws3uAS4Dndi3g7r8oW34x8IlahpRYeOGP8R3tLSMiQ6ik3KcBb5ZNrwRmHWD5zwL37+sBM5sPzAdwdwqFQoUxB+ro6Kj6uY1Qr3xdjz7ATuCQGe+m/dDqX79V11+tZD0fZD+j8tVfTU8cZmafAM4APrCvx919IbAwmQxdXV1VvU+hUKDa5zZCvfLt3NxDdPo5bAwRjOD1W3X91UrW80H2Mypf9To7OytarpJyXwUcWTY9PZk3gJmdD3wN+IC7b6/o3aViobQTNvfC1OlpRxGRJlBJuS8BZpjZMcSlPhe4tHwBMzsV+GdgtruvrXlKgU0bIQTt4y4iFRlybxl37weuAB4ElsezfJmZXWdmc5LF/gGYCPzQzH5vZvfWLXGrWpFcDFv7uItIBSoac3f3RcCiQfOuKbt/fo1zSZmwfi1h6W8BiI47MeU0ItIMdIRqEwiLfkRY8hhMnqIDmESkIir3JhCKG2HqkbRd/z2iDl0ZUUSGpnJvBj3d8YWwO0alnUREmoTKPePC5l545XmdckBEhkXlnnHhd7+O7xwxLd0gItJUVO5ZV9wEQPSnH085iIg0E5V71vUWYdwEfZAqIsOics+4sPwZnQVSRIZN5Z51m3ugVEo7hYg0GZV7hoXeIhQ3EZ1+TtpRRKTJqNwzLDyzJL5zeGWn+BQR2UXlnmG7doOMznxvyklEpNmo3LPs7ZXx7Zhx6eYQkaajcs+osHY1dK0h+uDFRFGUdhwRaTIq94wKi38J6BS/IlIdlXtGhccfgfETiE7/T2lHEZEmpHLPoBACrF8Lo8emHUVEmpSOac+QEALh3+7afdBSdMElKScSkWalcs+SLZsJP/M90zrNr4hUSeWesrDubcJP/4XQ3w992wY8FumcMiJSJZV7ysIzTxKefBSmHgmDd3k86h3phBKRpqdyT1F4ejHhV/dDWxtt136bqK2NsH4tpSv/CsaNJ5p8cNoRRaRJqdxTVHr0Adi4nuic84nakh2XphwKp8yi7T3npRtORJqayj1Nzy6Fk86g7VNX7J4VtbfTfvnXUgwlInmg/dxTErZvj++MHp1uEBHJpZYs9/DGK5R+eFt8sBAQdvSx87sL2Hnj1YRXX2pMiN5uAKITTmvM+4lIS2nJci/dcDXhoZ/Clt54xuo34KnHYfkzhGeeaEyInqTctS+7iNRBbsfcQ6lE+P4/ErrW7P1gUuqlm74Bo0bB1i17nvfrn7PzhT/GE6NG0/bpLxId+ie1z/fyc/EdlbuI1EFuy53e7vjMikdMh4MPHfjY4dNgzSoYPyGeHjUaDp9GdPhUwmsvx/P6tsdb8i8tq0u5U4y33Jn2H2r/2iLS8nJV7qUffp/w0rJ4YkcfANGcS2mr4kpGYXMPpS//BeGnd7HzkZ/t9Xh03Am0fewzez+vv59NC65k59q3D/wGXWvgoIOJxupCHCJSexWVu5nNBm4G2oFb3X3BoMfHAHcCpwPrgY+7+2u1jTq08NhD8db41OnxjD+ZSnTcCdW92PiJROf9KWHdW3s/9tZKwq9/DvsodzasY/sTj8L0Y2DKAQ5CmjCR6PiTqssmIjKEIcvdzNqBW4ALgJXAEjO7192fK1vss8BGd3+nmc0Frgc+Xo/AAP2r3mDnjd/YvXW+29bNRBdeQtvFc0f8HlEUEV36+X0+Vrr3bsL/u5ud131pz8xx42n7wlcJK14AoO3PP0l00hkjziEiUo1K9pY5C3jZ3Ve4ex9wDzD4XLSXAHck938EfMjM6nZtuB3Ln4EVL8CkKXDIYXu+Tjub6NSz6/W2u0WnnQ2nnb3nfceMgxeXwZuvwqb18ULTjq57DhGR/alkWGYa8GbZ9Epg1v6Wcfd+M+sGDgW6ahGyXOnXD1O849sAtH3hq0RjxtT6LYYUTT+a9r+5avd0WPU6pWu/SOn7/wh9ffEHtIM/xBURaaCGfqBqZvOB+QDuTqFQGPZrbJs6je3nfJD2zqOYOG1arSNWJRw8hZ7ZH6HUvQmAMcfNZNxhh6Wcav86OjqqWveNonwjl/WMyld/lZT7KuDIsunpybx9LbPSzDqAg4g/WB3A3RcCC5PJ0NVVxYb9sTMpzHo/XV1dbKvm+fXy0T0frk4pFKjq39YgBeUbkazng+xnVL7qdXZ2VrRcJeW+BJhhZscQl/hc4NJBy9wLfBr4LfAx4BF3DxWnFRGRmhryA1V37weuAB4ElsezfJmZXWdmc5LFvgccamYvA18BrqxXYBERGVq06+RZKQirV6+u6olZ/pMJlG+klG/ksp5R+aqXDMsMuTdiS544TEQk71TuIiI5pHIXEckhlbuISA6p3EVEcijVvWXSemMRkSaX6b1lomq/zOypkTy/3l/Kp3xpf2U9o/KN+GtIGpYREckhlbuISA41a7kvHHqRVCnfyCjfyGU9o/LVWZofqIqISJ0065a7iIgcQEMv1lELQ12su0EZjiS+IPjhxLt0LnT3m83sWuBzwLpk0a+6+6LkOVcRX2t2J/C37v5gnTO+BvQk79fv7meY2SHAD4CjgdcAc/eNySURbwY+DGwBLnP3pXXMdnySY5d3ANcAU0hp/ZnZbcDFwFp3PzGZN+z1ZWafBr6evOz/cPc76pjvH4A/A/qAV4DPuPsmMzua+AyuLyRPX+zuf50853TgdmAcsAj4Ui1Oz72ffNcyzO9nvX6+95PvB8DxySJTgE3ufkoa668emqrcK7xYdyP0A3/n7kvNbBLwlJk9nDx2k7v/z/KFzWwm8XnwTwA6gZ+b2XHuvrPOOc9z9/JT210J/Lu7LzCzK5PpvwcuAmYkX7OA77D3pRRrxt1fAE6B3d/TVcBPgM+Q3vq7HfhfxL+0dxnW+kp+GXwDOIP4l/5Tyf/PjXXK9zBwVXJpy+uBq5J8AK+4+yn7eJ3vEBfuE8TlNBu4v075YBjfz+Thev1875XP3T9elukGoLts+Uavv5prtmGZSi7WXXfu/tauLTV37yH+LX+ga/5dAtzj7tvd/VXgZeJ/S6OVX8j8DuC/lM2/092Duy8GppjZ1AZl+hDxD9LrB1im7uvP3R8FNuzjfYezvv4z8LC7b0gK/WHiH/665HP3h5LrLQAsJr5K2n4lGSe7++Jka/POsn9TzfMdwP6+n3X7+T5QvuQvMQPuPtBr1HP91UNTbblT2cW6Gyr5E+5U4t/k5wBXmNmngN8Rb91vJM69uOxpKznwL4NaCMBDZhaAf04ucXi4u7+VPP428bAS7Hu9TgPeov7mMvCHKivrD4a/vvY3vxH+koFDXceY2dNAEfi6uz+WZFnZ4HzD/X6m8fP9PmCNu79UNi8r669qzbblnilmNhH4MfBldy8S/8l2LPGQw1vADSnGe6+7n0Y8hHC5mb2//MFkyyPVsUIzGw3MAX6YzMrS+hsgC+trf8zsa8RDhXcls94CjnL3U4mvjPavZjY5hWiZ/X4OMo+BGxhZWX8j0mzlXsnFuhvCzEYRF/td7v5/Adx9jbvvdPcS8L/ZM3TQ8Nzuviq5XUs8nn0WsGbXcEtyuzatfImLgKXuvibJmpn1lxju+mp4TjO7jPiDwr/Y9cFeMtyxPrn/FPGHrcclWcqHbuqar4rvZxrrrwP4c8r+6snK+hupZiv33RfrTrb65hJfnLuhkjG67wHL3f3Gsvnl49QfAZ5N7t8LzDWzMcmFxmcAT9Yx34Tkg17MbAJwYZJl14XMSW7/rSzfp8wsMrP3AN1lwxH1NGCLKSvrr8xw19eDwIVmdrCZHUy83uu2V1SyZ8l/B+a4+5ay+YclH1RjZu8gXl8rkoxFM3tP8n/4U2X/pnrkG+73M42f7/OB591993BLVtbfSDXVmHuyV8Cui3W3A7e5+7IUopwDfBL4o5n9Ppn3VWCemZ1C/Of7a8DnAZILijvwHPGfz5fXeU+Zw4GfmBnE3+N/dfcHzGwJ4Gb2WeB14g+RIP7U/8PEH2xtId5rpa6SXzoXkKyjxLfSWn9mdjdwLlAws5XEe70sYBjry903mNk3iUsK4Dp3r/RDxmryXQWMAR5Ovte7dtl7P3Cdme0ASsBfl+X4Ant25bufGu3psZ985w73+1mvn+995XP377H3Zz6QwvqrBx2hKiKSQ802LCMiIhVQuYuI5JDKXUQkh1TuIiI5pHIXEckhlbuISA6p3EVEckjlLiKSQ/8fw1SlBAo3Oq0AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "frozen_lake_env = FrozenLakeEnv(desc=None, map_name=\"4x4\",is_slippery=False)\n", "\n", "policy_net, rewards = train(frozen_lake_env, batch_size=64, num_episodes=2000, annealing_steps=100, f_epsilon=0.0, gamma=0.9)\n", "avg_reward = np.mean(list(windowed(rewards, 100)), axis=1)\n", "plt.plot(avg_reward)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }