publications
Selected publications and preprints.
2026
-
Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State TrackingPreprint. Under review., 2026Developed a controlled benchmark for diagnosing process-level failures in web agents, enabling fine-grained analysis of exploration, execution, and decision-making behaviors.
