强化学习玩21点

2023-10-03 16:39| 来源: 网络整理| 查看: 265

sutton RL an introduction 2nd CH5例子 ref: https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/edit/master/chapter05/blackjack.py 本文解读整理上述示例代码

规则

21点（blackjack）是经典赌场游戏，玩的是在牌面和不超过21点的情况下尽可能大。牌面规定：Ace可以是1 或 11， J,Q,K均为10，无大小王。具体规则有很多种，书中规定如下：

The object of the popular casino card game of blackjack is to obtain cards the sum of whose numerical values is as great as possible without exceeding 21. All face cards count as 10, and an ace can count either as 1 or as 11. We consider the version in which each player competes independently against the dealer. The game begins with two cards dealt to both dealer and player. One of the dealer’s cards is face up and the other is face down. If the player has 21 immediately (an ace and a 10-card), it is called a natural. He then wins unless the dealer also has a natural, in which case the game is a draw. If the player does not have a natural, then he can request additional cards, one by one (hits), until he either stops (sticks) or exceeds 21 (goes bust ). If he goes bust, he loses; if he sticks, then it becomes the dealer’s turn. The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome—win, lose, or draw—is determined by whose final sum is closer to 21.

首先是21点游戏逻辑的定义和一些预设policy。作为玩家，只有hit 和 stand两种action 其进行决策只需考虑如下三点：

usable_ace手头是否有ace牌，且能叫为11点而不爆牌手头牌面值和（12-21）0-11不需考虑，因为无论抽到什么牌怎么都不可能爆牌，故一定是hit庄家的明牌。(1,…10)（J Q K都是10）所以共有2*10*10 200个state，故policy表和value表就是2*10*10的 import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt import seaborn as sns from tqdm import tqdm # actions: hit or stand ACTION_HIT = 0 ACTION_STAND = 1 # "strike" in the book ACTIONS = [ACTION_HIT, ACTION_STAND] # policy for player # 这是书中规定的一个policy，用于演示MC policy evaluation， # 其很朴素，只在自己牌面和为 20 or 21 时stand，其余一律hit，不考虑其他因素 POLICY_PLAYER = np.zeros(22, dtype=np.int) for i in range(12, 20): POLICY_PLAYER[i] = ACTION_HIT POLICY_PLAYER[20] = ACTION_STAND POLICY_PLAYER[21] = ACTION_STAND ###########这俩policy是为off-policy算法准备 # function form of target policy of player def target_policy_player(usable_ace_player, player_sum, dealer_card): return POLICY_PLAYER[player_sum] # function form of behavior policy of player def behavior_policy_player(usable_ace_player, player_sum, dealer_card): if np.random.binomial(1, 0.5) == 1: return ACTION_STAND return ACTION_HIT ########### # policy for dealer # 21点游戏规则规定的庄家policy，即持续hit直至>=17 POLICY_DEALER = np.zeros(22) for i in range(12, 17): POLICY_DEALER[i] = ACTION_HIT for i in range(17, 22): POLICY_DEALER[i] = ACTION_STAND # get a new card def get_card(): card = np.random.randint(1, 14)# [1,14) card = min(card, 10)#（J Q K都是10） return card # get the value of a card (11 for ace). def card_value(card_id): return 11 if card_id == 1 else card_id # play a game 环境交互核心函数，返回： #（state, reward, player_trajectory） # 其中state = [usable_ace_player, player_sum, dealer_card1] # reward 是+1 或-1或0 # player_trajectory [(usable_ace_player, player_sum, dealer_card1), action])的 # 序列 # @policy_player: specify policy for player # @initial_state: [whether player has a usable Ace, sum of player's cards, one card of dealer] # @initial_action: the initial action def play(policy_player, initial_state=None, initial_action=None): # player status # sum of player player_sum = 0 # trajectory of player player_trajectory = [] # whether player uses Ace as 11 usable_ace_player = False # dealer status dealer_card1 = 0 dealer_card2 = 0 usable_ace_dealer = False if initial_state is None: # generate a random initial state while player_sum 21 and ace_count: dealer_sum -= 10 ace_count -= 1 # dealer busts if dealer_sum > 21: return state, 1, player_trajectory usable_ace_dealer = (ace_count == 1) # compare the sum between player and dealer assert player_sum

【本文地址】

公司简介

联系我们