2010 Exam

Test 2010

Question 1


Fully Observable/Partially Observable


The agent's sensors do not give it access to the complete state of the environment.


- List of actions
- partially ordered plan

Question 2


P(A|B) = P(AB) / P(B)
P(B|A) = P(AB) / P(A)

P(A|B) * P(B) = P(B|A) * P(A)

P(A|B) = [ P(B|A) * P(A) ] / P(B)


A Kalman filter is most appropriate because the sensor model is Gaussian.

Alternate answer: A discrete table of estimates is most appropriate because there is a small set of easily enumerated states over which we need to express probability densities.


A PDF over a discrete set of densities: 0, 2, 4, and 6


Initial state is uniform across all densities (0.25 for each density).


New probability distribution after multiplying by the sensor model and renormalizing. e) Click here for details


Guess is 0 (highest value out of the distribution)


(g) , (h) repeat e/f


This doesn't necessarily indicate a bug because particle filters operate off of sampling with replacement. Since samples are replaced, it's quite possible that the same sample could be chosen twice.


The sensor model is $3d + N(3,100)$, through the ways of magic, this equates to $N(3d+3 , 100)$.

To calculate the weights of the particles, you calculate the y-value of the distribution $N(3 \times \text{sensor reading} + 3, \text{d-value}, \text{variance})$.

For d=2:
$N(3, 9, 100) = 0.0333225$ (can easily calculate with wolframalpha)

For d=3:
$N(3,12,100) = 0.026609$

for d=4:
$N(3,15,100) = 0.0194186$


Simply normalize the weights— sum up all the weights and divide by the sum. The probabilities are:

d=2 -> 43%
d=3 -> 33%
d=4 -> 24%

Question 3


It won't work. You can't make a prediction on the next location without velocity— they both need to be in the same filter. (?)

Alternate answer: As position and velocity are in fact mutually dependent, it will result in a significantly less robust filter. Events such as the ball hitting the ground will confuse the poor thing.

We would need to provide a very high variance for variables that can change suddenly, such as vertical velocity.

Question 5


Choose a random point or choose the goal and grow toward whichever one has been chosen.


You cannot find a global minimum with a finite number of samples. See shitty diagram.


I would use Downhill Simplex in this case. Downhill Simplex is very robust and would operate faster in this situation. In addition, Conjugate Gradient is not suitable for stochastic environments.


A conjugate direction or line is one in which you can travel with minimised movement along the initial line. It is defined more precisely as $d_1^T A d_2 = 0$ for direction vectors $d_1,~d_2$ with respect to some $A$.
See: http://www.pha.jhu.edu/~neufeld/numerical/lecturenotes8.pdf
and [https://wiki.ece.cmu.edu/ddl/index.php/Powell%27s_method]

Question 6


For some policy $\pi$, state and action $s, a$,

$Q ^\pi (s, a) = E_\pi \{ r_{t + 1} + \gamma V^\pi (s_{t + 1}) | s_t \leftarrow s, a_t \leftarrow a \}$

Where $E_\pi$ is the expected value for the policy, $r_{t}$ is the reward value at time t, $\gamma$ is the discount factor.

See: [http://www.lsi.upc.edu/~mmartin/Ag4-4x.pdf]


Initial table:

Park 6.0 6.0 6.0 6.0
Drive 4.0 4.0 4.0 4.0
Clean 4.0 4.0 4.0 4.0

Question 7


Coal cart on linear track, with coal dispenser at one end and coal chute at the other. If you cannot observe whether the cart has coal in it, you cannot determine whether you should be traveling to the chute or dispenser at the start (given that you can start anywhere along the track). Consequently, any plan from the any starting spot has a chance to be non-optimal.


Another example: Pendulum. When not releasing Drum 'n Bass, the pendulum swings. Assume you can only sense position, but not velocity. By having the current pendulum position and the previous pendulum position, you can approximate the unobservable state variable. This can be incorporated into Q-learning. (Note: edge cases are ambiguous)


State to Action Policies for a Partially Observable State Space

  • And another…
  • Presumably not Reinforcement Learning, for the reasons stated in 7a)
  • These could be incredibly wrong

Alternative answer to c question:
* (S-[0,1]) -> A : the belief state (probability distribution) to action.
* (A O *) -> A: history set to action.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License