PPO is the Reinforcement learning Technique being used by OpenAI etc.
As far as I can tell it is a method by which the updates to the model / nn is made using a ratio r(policy) between current and previous policys. This I take to mean as the outputs of the model / nn for a given state before and after being updated.
The policy for any given state is the output of the model at time (t) this output is a Q value array for each action. Therefore the ratio is between two arrays.
This is then restricted / bounded to within limits - making it remain closer / proximal to existing policys.
Then it is multiplied by the Advantage - a measure of how better an action is in the next state as compared to the action for the current state.
ratio(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
* Advantage previous(state,oldAct)
Advantage (s,a) = Q(s,a)-V(s)
= Bellman(next state)-Bellman(current state)
PPO:
r(policy) = probability ratio
r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
r(policy) = clip(r(policy),1-e,1+e)
Loss/Err = r(policy)*Advantage previous(state,oldAct)
ppo restricts/clips the policy ratio to within 1-e,1+e limits
PPO is descendant from TRPO (Trust Region Policy Optimization)
In TRPO a trust region with respect to the policy ratio is defined by the Kullback–Leibler divergence as follows:
TRPO:
r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
if DKL(policy(state),previous policy(state))<=lambda
Loss/Err = r(policy)*Advantage previous(state,oldAct)
(policy is output of nn given current state)