The Evolving Face of GUI (I) – Preliminary
Preface: Why talk about MDPs for GUIs?
When we say GUI agent, we picture something magical: an AI clicking through menus, filling forms, fixing our settings, and perhaps someday doing our taxes. But underneath that sci-fi gloss lies a classic problem from reinforcement learning (RL). Each GUI screen is a state, every mouse click is an action, and that satisfying green checkmark at the end? That’s your reward.
In other words, a GUI agent is just a multi-turn agent wearing a UX costume. To understand how it learns to click like a human (without rage-quitting), we first need to revisit the foundations: MDPs and POMDPs.
1. What is an MDP?
A Markov Decision Process (MDP) describes a fully observable environment where the next state depends only on the current state and action—no memory of the past required (that’s the “Markov” part).
Formally, an MDP is a 5-tuple (also some define it as a 4-tuple without discount factor $γ$):
\[(S, A, T, R, γ)\]- S: the set of possible states
- A: the set of possible actions
- T(s, a, s’): transition probability, i.e. $P(s_{t+1} = s’ \mid s_t = s, a_t = a)$
- R(s, a): reward function
- γ: discount factor $\gamma \in [0, 1)$, controlling how much we value future rewards
A policy $\pi$ is a (potentially probabilistic) mapping from states to actions: $\pi: S \to A$. The agent’s goal is to find an optimal policy $\pi^*$ that maximizes the expected discounted return:
\[J(\pi) = \mathbb{E}_\pi\left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]\]The state-value function and action-value function are then:
\[V^\pi(s) = \mathbb{E}_\pi\left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s \right]\] \[Q^\pi(s, a) = \mathbb{E}_\pi\left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right]\]The two are related via expectation over actions:
\[V^\pi(s) = \sum_a \pi(a|s) * Q^\pi(s, a)\]and the famous Bellman equation ties them recursively:
\[V^\pi(s) = \sum_a \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} T(s, a, s') V^\pi(s') \right]\]This formulation assumes we can see the entire state $s_t$ with perfect clarity. But in GUI land? Forget it. Half your buttons are hidden in scroll views, and the other half are lazy-loaded async chaos.
2. POMDP: when your agent is half-blind
A Partially Observable Markov Decision Process (POMDP) adds the painful reality: the agent doesn’t get to see the full state. Instead, it receives an observation o from an observation space Ω, generated by an observation function that depends on the new state and the action just taken.
Formally, a POMDP is defined as a 7-tuple:
\[(S, A, T, R, Ω, O, γ)\]- S: state space (the hidden truth — e.g., every pixel, every background process, every invisible modal dialog)
- A: action space (clicks, types, swipes, system calls)
- T(s, a, s’): transition function $T: S \times A \times S \to [0, 1]$, i.e. $\Pr(s’ \mid s, a)$
- R(s, a): reward function $R: S \times A \to \mathbb{R}$, representing task progress or penalties
- Ω: observation space (what the agent actually sees: screenshots, DOM snapshots, a11y trees)
- O(o | s’, a): observation function $O: S \times A \times \Omega \to [0, 1]$, i.e. $\Pr(o \mid s’, a)$, the probability of seeing observation $o$ after landing in state $s’$ via action $a$
- γ ∈ [0, 1): discount factor for future rewards
Belief State: the agent’s best guess
The agent doesn’t see $s_t$ directly. Instead, it maintains a belief state $b_t(s)$ — a probability distribution over all possible true states:
\[b_t(s) = \Pr(s_t = s \mid o_{1:t}, a_{1:t-1})\]The belief state is the agent’s “best guess” about where it actually is. But how does it update this belief when it takes an action and receives a new observation?
Belief Update: Bayesian filtering in action
While the belief state is formally defined as a probability distribution conditioned on the entire observation-action history $o_{1:t}, a_{1:t-1}$, we don’t need to recompute it from scratch at every timestep. Instead, we can update it recursively using only the previous belief, the last action, and the new observation.
Given the previous belief $b_{t-1}$, action $a_{t-1}$, and new observation $o_t$, the agent updates its belief using Bayes’ rule:
\[b_t(s') = \tau(b_{t-1}, a_{t-1}, o_t)(s') = \frac{\Pr(o_t \mid b_{t-1}, a_{t-1}, s')}{\Pr(o_t \mid b_{t-1}, a_{t-1})}\]Breaking it down:
\[\Pr(o_t \mid b_{t-1}, a_{t-1}, s') = O(o_t \mid s', a_{t-1}) \cdot \sum_{s \in S} T(s, a_{t-1}, s') \cdot b_{t-1}(s)\]This is the observation likelihood: the probability of seeing $o_t$ given we land in $s’$.
The normalizing constant (denominator) ensures the belief sums to 1:
\[\Pr(o_t \mid b_{t-1}, a_{t-1}) = \sum_{s' \in S} \Pr(o_t \mid a_{t-1}, s') \cdot \sum_{s \in S} T(s, a_{t-1}, s') \cdot b_{t-1}(s)\]Simply put: First, predict where you might land after taking action $a_{t-1}$ (weighted by transition probabilities). Then, use the new observation $o_t$ to update your belief about which of those predicted states you’re actually in.
From POMDP to Belief MDP
Here’s the magic trick: if we treat the belief state $b$ as the new “state,” the POMDP becomes a fully observable MDP — a Belief MDP.
A Belief MDP is a 5-tuple (or 4-tuple without discount factor):
\[(B, A, τ, R_B, γ)\]- B: the set of all possible belief states (distributions over $S$)
- A: same action space as the original POMDP
- τ: belief state transition function, $\tau(b, a, o)(s’)$ computes the next belief given current belief $b$, action $a$, and observation $o$
- R_B: belief-based reward function, $R_B(b, a) = \sum_{s \in S} b(s) R(s, a)$
- γ: discount factor (same as in the original POMDP)
Now the agent can apply standard MDP techniques (value iteration, policy gradients, etc.) over the belief space instead of the hidden state space.
The catch? The belief space is continuous and infinite-dimensional — even if the original state space was finite. So while theoretically elegant, solving Belief MDPs exactly is often intractable. This is why modern GUI agents use approximations: neural networks to encode beliefs, particle filters, or even just the raw observation history as a proxy.
Optimal policy and value functions
The optimal policy now maps beliefs to actions: $\pi^*(a \mid b)$
The value function becomes:
\[V^*(b) = \max_a \left[ R_B(b, a) + \gamma \sum_{o \in \Omega} \Pr(o \mid b, a) V^*(\tau(b, a, o)) \right]\]where $\Pr(o \mid b, a)$ is the probability of observing $o$ after taking action $a$ from belief $b$:
\[\Pr(o \mid b, a) = \sum_{s' \in S} O(o \mid s', a) \sum_{s \in S} T(s, a, s') b(s)\]Key difference from MDP: In an MDP, the value function depends on states (deterministic given current state). In a POMDP/Belief MDP, it depends on beliefs (distributions over states), and the agent must reason about future observations it hasn’t seen yet.
Why GUI is a POMDP: The agent never knows which pop-ups will spawn, what scroll views are hiding, or what lazy-loaded content will appear next. It must reason under uncertainty using partial cues — making GUI control the textbook example of a POMDP in the wild.
3. Multi-turn RL: long horizons and delayed satisfaction
A GUI task rarely ends after one action. Filling a form, booking a ticket, or navigating a nested menu can take tens of steps. This is where multi-turn RL comes in.
The total return from time step $t$ is:
\[G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\]The advantage function tells us how much better an action is than average:
\[A_t = Q(s_t, a_t) - V(s_t)\]Using temporal difference (TD) learning, we can approximate it recursively:
\[A_t \approx r_t + \gamma V(s_{t+1}) - V(s_t)\]In policy gradient methods like PPO (Proximal Policy Optimization), we update the policy to increase the likelihood of good actions while keeping the new policy close to the old one:
\[\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t\right) \right]\]where $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$.
This is how agents survive long GUI sequences without exploding gradients or catastrophic forgetfulness.
4. Why GUI fits the POMDP paradigm
Because no agent ever sees the full GUI truth. Screens are partial, dynamic, and unpredictable. The agent acts based on observations, not states — making GUI reasoning a natural POMDP case.
Rewards are sparse, delayed, and occasionally deceptive (e.g., clicking a button might open a confirmation dialog instead of finishing the task). Thus, belief modeling and multi-turn reasoning are critical.
5. The Action Space: where the chaos begins
A GUI agent’s action space defines how it interacts with the world. Here are the major flavors:
(1) DOM Tree
The DOM (Document Object Model) provides a structural map of web elements. Every node (button, input, div) is an action candidate. While semantically rich, it often leads to huge search spaces, yet understanding this structure is vital to building capable and generalizable web agents.
(2) Accessibility (a11y) Tree
The a11y tree simplifies the DOM by focusing on accessible, labeled, and visible components. It offers structured semantics, making it ideal for agents that rely on language models or vision-language encoders.
(3) ADB (Android Debug Bridge)
For Android, ADB lets agents simulate real user actions: tapping coordinates, swiping, typing text. The state space often comes from screenshots or UI XML dumps, making it realistic but noisy.
(4) pyautogui
The desktop automation route. It directly moves the mouse and types on the keyboard. It has zero semantic understanding—just pixel-level control. A full-blind, full-POMDP experience.
6. Summary: RL behind the GUI curtain
GUI reasoning can be framed as RL—but it’s not the only way.
- MDP: idealized world with full observability.
- POMDP: the real GUI world with uncertainty and hidden elements.
- Multi-turn RL: one approach to surviving long sequences with delayed rewards.
Every GUI click, scroll, and hover is an action in a stochastic environment with hidden states. While the RL formulation provides the foundation, modern GUI agents enhance it by incorporating:
- Behavior cloning: learning from human demonstrations (imitation learning) to bootstrap the policy and sidestep sparse reward problems.
- LLM-based planning: leveraging large language models to decompose high-level tasks into executable action sequences.
- Hierarchical multi-agent design: breaking complex tasks into subtasks handled by specialized sub-agents (e.g., Mobile-Agent-v3 uses a planning agent + grounding agent + reflection agent architecture).
- Best-of-N sampling: Agent-S3 uses Behavior Best-of-N (bBoN) — running multiple rollouts and selecting the best outcome, achieving SOTA 69.9% success rate.
- Search algorithms: Monte Carlo Tree Search for test-time scaling and exploration.
The MDP/POMDP framework gives us a language to describe the problem, and these techniques show how to solve it effectively. Whether you combine PPO with behavior cloning, use GPT-4 for hierarchical planning, or run MCTS over learned policies? That’s where the research gets exciting.
Next: Part II – GUI Grounding Researches
We’ll explore how modern multimodal models learn to ground GUI components into language, how they align vision and structure, and why grounding may be the missing key to truly intelligent computer-use agents.
(to be continued…)