JMANI

Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim 본문

AI/Reinforcement Learning

Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim

jmani 2022. 5. 20. 10:22

link: https://www.youtube.com/watch?v=MQ-3QScrFSI&list=PLlMkM4tgfjnKsCWav-Z2F-MMFRx-2gMGG&index=6

  • 기존 Q-learning의 문제점: 경험했던 곳만 방문
  • Exploit VS Exploration
    • Exploit: 현재에 있는 값을 이용
    • Exploration: 모험, 도전

Exploit VS Exploration: E-greedy

e값에 따라 랜덤하게 새로운 길로 모험

e = 0.1
if rand < e: # 10%의 확률로 랜덤하게 이동
	a = random
else:
	a = argmax(Q(s,a)) # 90%의 확률로 아는길로 이동

Exploit VS Exploration: decaying E-greedy

초반에는 랜덤하게, 뒤로 갈수록 아는 길로 이동

for i in range(1000):
    e = 0.1 / (i+1) # 후반으로 갈수록 e값이 줄어듦
    if random(1) < e: 
    	a = random
    else:
        a = argmax(Q(s,a))
import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register

# Register Frozen with is_slippery False
register(
    id='FrozenLake-v3',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False}
)

env = gym.make('FrozenLake-v3')

"""E-greedy"""
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])  # 16, 4
# Discount factor
dis = 0.99
num_episodes = 2000

# create lists to contain total rewards ans steps per episode
rList = []
for i in range(num_episodes):
    # Reset environment and get first new observation
    state = env.reset()
    rAll = 0
    done = False

    e = 1.0 / ((i//100)+1)
    # The Q-Table learning algorithm
    while not done:
        # Choose an action by e greedy
        if np.random.rand(1) < e:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])

        # Get new sate and reward from environment
        new_state, reward, done, _ = env.step(action)

        # Update Q-Table with new knowledge using learning rate
        Q[state, action] = reward + dis * np.max(Q[new_state, :])

        rAll += reward
        state = new_state

    rList.append(rAll)

print("Success rate: " + str(sum(rList)/num_episodes))
print("Final Q-Table Values")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()

Exploit VS Exploration: add random noise

argmax에 랜덤한 노이즈 추가

무조건 랜덤하게 이동하는 E-greedy와 달리 random noise는 2, 3번째로 높은 argmax로 이동할 가능성이 큼

for i in range(1000):
	a = argmax(Q(s,a) + random_values / (i+1))

Learning Q(s,a) with discounted reward

1번 길보다 2번길이 효율이 좋음

  • 같은 보상을 받을 때, agent는 헷갈릴 수 있음
  • 현재의 reward와 미래의 reward 중 현재의 reward가 더 중요함
  • discounted reward: 미래의 reward에 gamma(0~1)를 곱해서 가치를 줄임

discounted reward로 인해 1번 길로 갈 필요가 없음

import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register

# Register Frozen with is_slippery False
register(
    id='FrozenLake-v3',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False}
)

env = gym.make('FrozenLake-v3')

# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])  # 16, 4
# Discount factor
dis = 0.99
num_episodes = 2000

# create lists to contain total rewards ans steps per episode
rList = []
for i in range(num_episodes):
    # Reset environment and get first new observation
    state = env.reset()
    rAll = 0
    done = False

    # The Q-Table learning algorithm
    while not done:
        action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i+1))

        # Get new sate and reward from environment
        new_state, reward, done, _ = env.step(action)

        # Update Q-Table with new knowledge using learning rate
        Q[state, action] = reward + dis * np.max(Q[new_state, :])

        rAll += reward
        state = new_state

    rList.append(rAll)

print("Success rate: " + str(sum(rList)/num_episodes))
print("Final Q-Table Values")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()

e-greedy 보다 성능이 좋음

Comments