Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

JMANI

Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim 본문

AI/Reinforcement Learning

Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim

jmani 2022. 5. 20. 10:22

link: https://www.youtube.com/watch?v=MQ-3QScrFSI&list=PLlMkM4tgfjnKsCWav-Z2F-MMFRx-2gMGG&index=6

기존 Q-learning의 문제점: 경험했던 곳만 방문
Exploit VS Exploration
- Exploit: 현재에 있는 값을 이용
- Exploration: 모험, 도전

Exploit VS Exploration: E-greedy

e값에 따라 랜덤하게 새로운 길로 모험

e = 0.1
if rand < e: # 10%의 확률로 랜덤하게 이동
	a = random
else:
	a = argmax(Q(s,a)) # 90%의 확률로 아는길로 이동

Exploit VS Exploration: decaying E-greedy

초반에는 랜덤하게, 뒤로 갈수록 아는 길로 이동

for i in range(1000):
    e = 0.1 / (i+1) # 후반으로 갈수록 e값이 줄어듦
    if random(1) < e: 
    	a = random
    else:
        a = argmax(Q(s,a))

import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register

# Register Frozen with is_slippery False
register(
    id='FrozenLake-v3',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False}
)

env = gym.make('FrozenLake-v3')

"""E-greedy"""
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])  # 16, 4
# Discount factor
dis = 0.99
num_episodes = 2000

# create lists to contain total rewards ans steps per episode
rList = []
for i in range(num_episodes):
    # Reset environment and get first new observation
    state = env.reset()
    rAll = 0
    done = False

    e = 1.0 / ((i//100)+1)
    # The Q-Table learning algorithm
    while not done:
        # Choose an action by e greedy
        if np.random.rand(1) < e:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])

        # Get new sate and reward from environment
        new_state, reward, done, _ = env.step(action)

        # Update Q-Table with new knowledge using learning rate
        Q[state, action] = reward + dis * np.max(Q[new_state, :])

        rAll += reward
        state = new_state

    rList.append(rAll)

print("Success rate: " + str(sum(rList)/num_episodes))
print("Final Q-Table Values")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()

Exploit VS Exploration: add random noise

argmax에 랜덤한 노이즈 추가

무조건 랜덤하게 이동하는 E-greedy와 달리 random noise는 2, 3번째로 높은 argmax로 이동할 가능성이 큼

for i in range(1000):
	a = argmax(Q(s,a) + random_values / (i+1))

Learning Q(s,a) with discounted reward

같은 보상을 받을 때, agent는 헷갈릴 수 있음
현재의 reward와 미래의 reward 중 현재의 reward가 더 중요함
discounted reward: 미래의 reward에 gamma(0~1)를 곱해서 가치를 줄임

import gym
import numpy as np
import matplotlib.pyplot as plt
from gym.envs.registration import register

# Register Frozen with is_slippery False
register(
    id='FrozenLake-v3',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False}
)

env = gym.make('FrozenLake-v3')

# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])  # 16, 4
# Discount factor
dis = 0.99
num_episodes = 2000

# create lists to contain total rewards ans steps per episode
rList = []
for i in range(num_episodes):
    # Reset environment and get first new observation
    state = env.reset()
    rAll = 0
    done = False

    # The Q-Table learning algorithm
    while not done:
        action = np.argmax(Q[state, :] + np.random.randn(1, env.action_space.n) / (i+1))

        # Get new sate and reward from environment
        new_state, reward, done, _ = env.step(action)

        # Update Q-Table with new knowledge using learning rate
        Q[state, action] = reward + dis * np.max(Q[new_state, :])

        rAll += reward
        state = new_state

    rList.append(rAll)

print("Success rate: " + str(sum(rList)/num_episodes))
print("Final Q-Table Values")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()

'AI > Reinforcement Learning' 카테고리의 다른 글

Lecture 6: Q-Network by Sung Kim (0)	2022.05.23
Lecture 5: Q-learning on Nondeterministic Worlds! by Sung Kim (0)	2022.05.20
Lecture 3: Dummy Q-learning (table) by Sung Kim (0)	2022.05.19
Lecture 2: Playing OpenAI GYM Games by Sung Kim (0)	2022.05.19
Lecture 1: RL 수업소개 (Introduction) by Sung Kim (0)	2022.05.19

'AI/Reinforcement Learning' Related Articles

Comments

JMANI

Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim 본문

Lecture 4: Q-learning (table) exploit&exploration and discounted reward by Sung Kim

Exploit VS Exploration: E-greedy

Exploit VS Exploration: decaying E-greedy

Exploit VS Exploration: add random noise

Learning Q(s,a) with discounted reward

'AI > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바