注, $d^{\pi_\theta}(s)$ 是在当前策略下马尔科夫链的关于状态的一个静态分布.
另外一个策略目标的写法, 这个写法的展开后等于 Average reward per time-step, 不过是强调了调整策略模型参数 $\theta$ 来优化
关于这个写法, 要记住
这是非常常用的数值计算方法,特别是当梯度函数本身很难得到的时候。具体做法是,针对参数θ的每一个分量 $θ_k$,使用如下的公式粗略计算梯度
$u_k$ 是一个单位向量,仅在第k个维度上值为1,其余维度为0. i.e. one-hot向量
有限差分法简单,不要求策略函数可微分,适用于任意策略;但有噪声,且大多数时候不高效. 也能解决问题.
似然比方法公式推导, 需要梯度上升, 也就是要给出优化目标的梯度函数
个人理解
如前所述, $\pi_\theta (a_t \vert s_t)$ 为状态到动作的概率分布函数, 当动作为离散动作时, Softmax函数(或者是最后一层为softmax的神经网络)常被用来表示各动作的概率.
略过
策略梯度中学习的第一个算法.
Cart Pole在OpenAI的gym模拟器里面,是相对比较简单的一个游戏。游戏里面有一个小车,上有竖着一根杆子。小车需要左右移动来保持杆子竖直。如果杆子倾斜的角度大于15°,那么游戏结束。小车也不能移动出一个范围(中间到两边各2.4个单位长度)。如下图所示:
这是一个实现上的问题, 在监督学习的训练中, 我们有标签数据, 正确的标签数据与我们估计值之间的某种形式的误差为代价函数是我们的优化目标, 我们求梯度然后采用梯度下降的方式进行模型参数更新.
这里,
我们采用伪标签的方式
总结, 采用伪标签, 我们总是尝试使采样出的动作在将来更有可能发生, 但是反馈奖励大的动作我们增大的更多(参数更新量经过R的缩放), 而反馈奖励小的动作我们增大的小一些, 由于概率分布的归一化, 反馈奖励小的动作实际上我们是减小了发生的概率.
同时从这里也可以看出一个潜在的改进, 就是反馈奖励小的动作(例如反馈奖励小于平均反馈奖励的)应当是负梯度,反馈奖励大的动作(例如反馈奖励大于平均反馈奖励的)是正梯度. 这其实是其改进算法REINFORCE BASELINE方法的基础.
引用网页 http://scott.fortmann-roe.com/docs/BiasVariance.html
REINFORCE用到无偏估计, 应该是低偏差的情况, 所以我们需要降低方差来提高性能.
出发点, 在基础的REINFORCE算法中,我们对于回合中每一步的奖励都是一致的, 也就是回合的总奖励.
改进, 采用每步打折的奖励(per-step discounted return)
时刻t的奖励R,是使用从时刻t之后的每步的奖励之和,且每步的奖励随时间乘以折扣系数gamma.
正如小节, REINFORCE的更新中提到的潜在的改进所述,
基线的选择
采用value function作为基线的方案
如何估计 $V_\phi$, 直接的思路如下
m个回合, 每个回合T步
具体算法
建立策略神经网络
import cntk as C
from cntk.layers import Sequential, Dense
from cntk.logging import ProgressPrinter
import numpy as np
import gym
...
state_dim = env.observation_space.shape[0] # Dimension of state space
action_count = env.action_space.n # Number of actions
hidden_size = 128 # Number of hidden units
update_frequency = 20
# The policy network maps an observation to a probability of taking action 0 or 1.
observations = C.sequence.input_variable(state_dim, np.float32, name="obs")
W1 = C.parameter(shape=(state_dim, hidden_size), init=C.glorot_uniform(), name="W1")
b1 = C.parameter(shape=hidden_size, name="b1")
layer1 = C.relu(C.times(observations, W1) + b1)
W2 = C.parameter(shape=(hidden_size, action_count), init=C.glorot_uniform(), name="W2")
b2 = C.parameter(shape=action_count, name="b2")
layer2 = C.times(layer1, W2) + b2
output = C.sigmoid(layer2, name="output")
设立标签(label)(后续的代码可以看出是伪标签), 计算discounted reward, 计算损失函数($J(\theta)$), 创建优化器
# Label will tell the network what action it should have taken.
label = C.sequence.input_variable(1, np.float32, name="label")
# return_weight is a scalar containing the discounted return. It will scale the PG loss.
return_weight = C.sequence.input_variable(1, np.float32, name="weight")
# PG Loss
loss = -C.reduce_mean(C.log(C.square(label - output) + 1e-4) * return_weight, axis=0, name='loss')
# Build the optimizer
lr_schedule = C.learning_rate_schedule(lr=0.1, unit=C.UnitType.sample)
m_schedule = C.momentum_schedule(0.99)
vm_schedule = C.momentum_schedule(0.999)
optimizer = C.adam([W1, W2], lr_schedule, momentum=m_schedule, variance_momentum=vm_schedule)
用神经网络做value function的近似, 模型的输出就是估计的价值(也就是基线). 这里称之为评论家(critic).
critic_input = 128 # shape : output dimension of input layer
critic_output = 1 # shape : output dimension of output layer
critic = Sequential([
Dense(critic_input, activation=C.relu, init=C.glorot_uniform()),
Dense(critic_output, activation=None, init=C.glorot_uniform(scale=.01))
])(observations)
# TODO 2: Define target and Squared Error Loss Function, adam optimizier, and trainer for the Critic.
critic_target = C.sequence.input_variable(1, np.float32, name="target")
critic_loss = C.squared_error(critic, critic_target)
critic_lr_schedule = C.learning_rate_schedule(lr=0.1, unit=C.UnitType.sample)
critic_optimizer = C.adam(critic.parameters, critic_lr_schedule, momentum=m_schedule, variance_momentum=vm_schedule)
critic_trainer = C.Trainer(critic, (critic_loss, None), critic_optimizer)
算法主体,
...
for episode_number in range(max_number_of_episodes):
states, rewards, labels = [],[],[]
done = False
observation = env.reset()
t = 1
while not done:
state = np.reshape(observation, [1, state_dim]).astype(np.float32)
states.append(state)
# Run the policy network and get an action to take.
prob = output.eval(arguments={observations: state})[0][0][0]
# Sample from the bernoulli output distribution to get a discrete action
action = 1 if np.random.uniform() < prob else 0
# Pseudo labels to encourage the network to increase
# the probability of the chosen action. This label will be used
# in the loss function above.
y = 1 if action == 0 else 0 # a "fake label"
labels.append(y)
# step the environment and get new measurements
observation, reward, done, _ = env.step(action)
reward_sum += float(reward)
# Record reward (has to be done after we call step() to get reward for previous action)
rewards.append(float(reward))
stats.episode_rewards[episode_number] += reward
stats.episode_lengths[episode_number] = t
t += 1
# Stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(states)
epl = np.vstack(labels).astype(np.float32)
epr = np.vstack(rewards).astype(np.float32)
# Compute the discounted reward backwards through time.
discounted_epr = discount_rewards(epr)
# TODO 3
# Train the critic to predict the discounted reward from the observation
# - use train_minibatch() function of the critic_trainer.
# - observations is epx which are the states, and critic_target is discounted_epr
# - then predict the discounted reward using the eval() function of the critic network and assign it to baseline
critic_trainer.train_minibatch({observations:epx, critic_target:discounted_epr}) # modify this
baseline = critic.eval({observations:epx}) # modify this
# Compute the baselined returns: A = R - b(s). Weight the gradients by this value.
baselined_returns = discounted_epr - baseline
# Keep a running estimate over the variance of the discounted rewards (in this case baselined_returns)
for r in baselined_returns:
running_variance.add(r[0])
# Forward pass
arguments = {observations: epx, label: epl, return_weight: baselined_returns}
state, outputs_map = loss.forward(arguments, outputs=loss.outputs,
keep_for_backward=loss.outputs)
# Backward pass
root_gradients = {v: np.ones_like(o) for v, o in outputs_map.items()}
vargrads_map = loss.backward(state, root_gradients, variables=set([W1, W2]))
for var, grad in vargrads_map.items():
gradBuffer[var.name] += grad
# Only update every 20 episodes to reduce noise
if episode_number % update_frequency == 0:
grads = {W1: gradBuffer['W1'].astype(np.float32),
W2: gradBuffer['W2'].astype(np.float32)}
updated = optimizer.update(grads, update_frequency)
# reset the gradBuffer
gradBuffer = dict((var.name, np.zeros(shape=var.shape))
for var in loss.parameters if var.name in ['W1', 'W2', 'b1', 'b2'])
print('Episode: %d/%d. Average reward for episode %f. Variance %f' % (episode_number, max_number_of_episodes, reward_sum / update_frequency, running_variance.get_variance()))
sys.stdout.flush()
reward_sum = 0
stats.episode_running_variance[episode_number] = running_variance.get_variance()
Cart Pole游戏的实验
REINFORCE without BASELINE 结果
REINFORCE with BASELINE 结果
从结果看出, 引入baseline做critic后,方差降低到1200左右,性能稳定了, 240回合左右能稳定输出最大奖励200.
前面所讲述的带基线的REINFORCE算法也可以说是演员评论家方法
在上面的式子中仍然有一项,我们没有学习,而是用蒙特卡洛方法不断的采样统计的,就是上式中反馈奖励 $R_t$ , 尽管m趋于无穷时是无偏的,但该项具有比较大的方差. 如何降低方差呢,
Q value function 与 V value function的差值为优势函数.
更新梯度公式为
如何估计A呢? 可以建立一个Q network和一个V network分别估计Q和V然后相减得到A, 但这样会增加bias
Asynchronous Advantage Actor Critic算法.
算法伪代码
n步Q值估计代码
# TODO: Create a function that returns an array of n-step targets, one for each timestep:
# target[t] = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... + \gamma^n V(s_{t+n})
# Where r_t is given by episode reward (epr) and V(s_n) is given by the baselines.
def compute_n_step_targets(epr, baselines, gamma=0.999, n=15):
""" Computes a n_step target value. """
n_step_targets = np.zeros_like(epr)
## Code here
## target[t] = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... + \gamma^n V(s_{t+n})
for t_start in range(len(epr)):
# backward calculation
t_end = min(t_start+n,len(baselines)-1)
R = baselines[t_end]
for t in range(t_end-1, t_start-1, -1):
R = epr[t]+gamma*R
n_step_targets[t_start]=R
return n_step_targets
Actor-Critic n步Q值估计的性能
可以看出, 仍然需要500回合左右才能稳定得到运行200步的最大奖励, 但是方差进一步从带基线的REINFORCE算法的1000多降低到70左右
在策略梯度算法中, 很多问题具有特征 1. reward稀疏, 2. target不固定,而是采样得到。在学习过程中容易“步子迈太大,容易扯着蛋”, 也就是学习的更新步长不好搞, 有个思路是用额外约束条件来限制策略更新, 使更新前后的策略不要相差太大.
TRPO直接约束: 更新前后两个策略的 KL 距离不超过一定阈值. 约束优化问题. 在TRPO中使使用拟牛顿法的conjugate gradient求近似解的,需要求KL divergence这个constraint的二次导.
PPO主要的改进是简化TRPO的优化方法. 避免求二次导, 直接把新旧策略的KL距离做在loss函数里.
对于优化问题, 一步到位的求解析解的优化, 无约束时一阶导数等于0求解, 等式约束时拉格朗日乘子法, 不等式约束时要用到KKT条件.
在监督学习中常见的带正则项的代价函数,例如lasso和ridge回归, 在原有的代价函数(例如回归问题时的最小均方差)上,在加上一阶或者二阶的惩罚项, $\lambda | w | $ . w为模型系数. 来达到既能优化最小均方差,同时又能满足某些条件. |
同理, 在策略梯度优化时, PPO额外加了一个惩罚项,其正比于新旧策略的KL距离.
PPO算法伪码
DPPO就是分布式的PPO
在Deepmind的DPPO文章中,除了用KL散度做额外的惩罚,也可以用clip的版本,据称效果更好.
mircosoft course <<reinforcement learning explained>> on edx.org