Introduction
A cost function C(X,u) designates a weighted penalty for applying control input u at state X
In a controlled dynamic system, the Value-Function V(X) represents the cost of the system between the interval [t,T] if it is currently at state X and chooses subsequent control inputs optimally
The Q-Function Q(X,u) simiarly represents the optmal cost of the system between the interval [t,T] if it is current at state X IF it elects to use control input u
If we assume an optimal controller, then
uminQ(X,u)=V(X)
Bellmans Equation
Q(XT,uT)=C(XT,uT)+V(XT+1)
Defining a quadratic cost function
CTcTc(xT,uT)=[CxxCuxCxuCuu]=[CxCu]=21[XTuT]TCT[XTuT]+[XTuT]TcT
Populating Q function
Q(XT,uT)=21[XTuT]TCT[XTuT]+[XTuT]TcT+V(XT+1)=21[XTuT]T[CxxCuxCxuCuu][XTuT]+[XTuT]T[cxcu]+V(XT+1)
Expanding Q function
Q(XT,uT)=21XTTCxxXT+21XTTCxuuT+21uTTCuxXT+21uTTCuuuT+XTTcx+uTTcu+V(XT+1)
Collecting similar terms
Q(XT,uT)=21XTTCxxXT+uTTCuxXT+21uTTCuuuT+XTTcx+uTTcu+V(XT+1)
SPECIAL CASE WHEN AT TERMINAL TIME STEP (T=N)
Assumes quadratic value function; so we can predict how value changes with X
TV(XN+1)=N=0
Finding gradient of Q function wrt. control input: See matrix derivative cheat sheet
∇uQ(XT,uT)=CuxXT+CuuuT+uTTcu
Assign to zero to find minima
CuxXT+CuuuT+cuCuuuTuT=0=−cu−CuxXT=Cuu−1(−cu−CuxXT)
Assigning to control gains
uTKTkT=KTXT+kT=−Cuu−1Cux=−Cuu−1cu
Plug in expression for control input into Q function. There is now no dependency on control input and it is equivalent to the value function.
V(XT)=21[XTKTXT+kT]T[CxxCuxCxuCuu][XTKTXT+kT]+[XTKTXT+kT]T[cxcu]
Expand linear algebra (1)
V(XT)=21[XTTCxxXT+XTTCxu(KTXT+kT)+(KTXT+kT)TCuxXT+(KTXT+kT)TCuu(KTXT+kT)]+XTTcx+(KTXT+kT)Tcu
Expand transpose expressions (2)
V(XT)=21[XTTCxxXT+XTTCxu(KTXT+kT)+(XTTKTT+kTT)CuxXT+(XTTKTT+kTT)Cuu(KTXT+kT)]+XTTcx+(XTTKTT+kTT)cu
Multiply out (3)
V(XT)=21[XTTCxxXT+XTTCxuKTXT+XTTCxukT+XTTKTTCuxXT+kTTCuxXT+XTTKTTCuuKTXT+XTTKTTCuukT+kTTCuuKTXT+kTTCuukT]+XTTcx+XTTKTTcu+kTTcu
Group (4)
V(XT)=XTTCxukT+XTTKTTCuukT+21[XTTCxxXT+XTTCxuKTXT+XTTKTTCuxXT+XTTKTTCuuKTXT+kTTCuukT]+XTTcx+XTTKTTcu+kTTcu
Simplify
V(XT)VTvTcont=21XTTVTXT+XTTvT=Cxx+CxuKT+KTTCux+KTTCuuKT=CxukT+KTTCuukT+KTTcu+cx=kTTCuukT+kTTcu
Using the dynamics of the system
Linear
XT=FT−1[XT−1uT−1]+fT−1
Plugging into value expression
V(XT)=21(FT−1[XT−1uT−1]+fT−1)TVT(FT−1[XT−1uT−1]+fT−1)+(FT−1[XT−1uT−1]+fT−1)TvT
Expanding
V(XT)=21[[XT−1uT−1]TFT−1TVTFT−1[XT−1uT−1]+[XT−1uT−1]TFT−1TVTfT−1+fT−1TVTFT−1[XT−1uT−1]+fT−1TVTfT−1]+(FT[XT−1uT−1]+fT−1)TvT
Grouping
V(XT)=[XT−1uT−1]TFT−1TVTfT−1+21[XT−1uT−1]TFT−1TVTFT−1[XT−1uT−1]+([XT−1uT−1]TFT−1T+fT−1T)vT
Loop
Looping backeards through T=[N:−1:0]
Q(XT−1,uT−1)=21[XT−1uT−1]TCT−1[XT−1uT−1]+[XT−1uT−1]TcT−1+V(XT)=21[XT−1uT−1]TCT−1[XT−1uT−1]+[XT−1uT−1]TcT−1+[XT−1uT−1]TFT−1TVTfT−1+21[XT−1uT−1]TFT−1TVTFT−1[XT−1uT−1]+([XT−1uT−1]TFT−1T+fT−1T)vT=21[XT−1uT−1]T(CT−1+FT−1TVTFT−1)[XT−1uT−1]+[XT−1uT−1]T(cT−1+FT−1TVTfT−1+FT−1TvT)+fT−1TvT
Collecting terms
Q(XT−1,uT−1)QT−1qT−1=21[XT−1uT−1]TQT−1[XT−1uT−1]+[XT−1uT−1]TqT−1=CT−1+FT−1TVTFT−1=[Q(xT−1,xT−1)Q(uT−1,xT−1)Q(xT−1,uT−1)Q(uT−1,uT−1)]=cT−1+FT−1TVTfT−1+FT−1TvT=[q(xT−1)q(uT−1)]
Using same re-arranging as previously, finding gradient of Q function wrt. control input. This assumes that FT−1 is constant with X and u
uT−1KT−1kT−1=KT−1XT−1+kT−1=−Q(uT−1,uT−1)−1Q(uT−1,xT−1)=−Q(uT−1,uT−1)−1q(uT−1)
Plug in expression for control input into Q function. Since there is no dependency on control input, this is equivalent to a value function
V(XT−1)=21[XT−1KT−1XT−1+kT−1]T[Q(xT−1,xT−1)Q(uT−1,xT−1)Q(xT−1,uT−1)Q(uT−1,uT−1)][XT−1KT−1XT−1+kT−1]+[XT−1KT−1XT−1+kT−1]T[q(xT−1)q(uT−1)]
V(XT−1)VT−1vT−1cont=21XT−1TVT−1XT−1+XT−1TvT−1+V(XT)=Q(xT−1,xT−1)+Q(xT−1,uT−1)KT−1+KT−1TQ(uT−1,xT−1)+KT−1TQ(uT−1,uT−1)KT−1=Q(xT−1,uT−1)kT−1+KT−1TQ(uT−1,uT−1)kT−1+KT−1Tq(uT−1)+q(xT−1)=kT−1TCuukT−1+kT−1Tcu