Concise Proximal Policy Optimization

by **hbyte** » Thu Feb 16, 2023 6:17 pm

PPO is the Reinforcement learning Technique being used by OpenAI etc.

As far as I can tell it is a method by which the updates to the model / nn is made using a ratio r(policy) between current and previous policys. This I take to mean as the outputs of the model / nn for a given state before and after being updated.

The policy for any given state is the output of the model at time (t) this output is a Q value array for each action. Therefore the ratio is between two arrays.

This is then restricted / bounded to within limits - making it remain closer / proximal to existing policys.

Then it is multiplied by the Advantage - a measure of how better an action is in the next state as compared to the action for the current state.

ratio(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)

* Advantage previous(state,oldAct)

Advantage (s,a) = Q(s,a)-V(s)

= Bellman(next state)-Bellman(current state)

PPO:

r(policy) = probability ratio

r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)

r(policy) = clip(r(policy),1-e,1+e)

Loss/Err = r(policy)*Advantage previous(state,oldAct)

ppo restricts/clips the policy ratio to within 1-e,1+e limits

PPO is descendant from TRPO (Trust Region Policy Optimization)

In TRPO a trust region with respect to the policy ratio is defined by the Kullback–Leibler divergence as follows:

TRPO:

r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)

if DKL(policy(state),previous policy(state))<=lambda

Loss/Err = r(policy)*Advantage previous(state,oldAct)

(policy is output of nn given current state)

Concise Proximal Policy Optimization

Concise Proximal Policy Optimization

Who is online