Bellman optimal equation for Q
Q(s, a), the expected return from starting state s, by taking action a at time t. r(s, a), reward at state s, by taking action a maxQ(s’, a’), maximized expected return for next state-action(s’,a’). Need to find the a’, which maximizes it.