Research

Current Work

I’m deeply involved with task quality for Terminal-Bench 3, the next iteration of the Terminal-Bench benchmark, launching June 2026. The goal is a harder, more realistic set of command-line tasks that pushes beyond the limits of current frontier models, with a heavy emphasis on verifier robustness and resistance to reward hacking.

In Progress

Training secure coding agents. A follow-up to SusVibes, moving from benchmarking to training. The original work showed that frontier models produce functionally correct but insecure code. This project improves data synthesis methods and explores training outcomes.

Training coding agents with privileged information. This project injects oracle hints (such as the fix patch) at training time to elicit correct trajectories the model can’t reach unaided, and uses action-level entropy to decide when a teacher should guide the agent versus let it act on its own.

Under Review

Three papers are currently under review at NeurIPS 2026.

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops. A method for automatically building exploit-resistant verifiers for agent benchmarks, using adversarial loops of cooperating LLM agents rather than per-task manual patching.

SWE-Marathon: Long-Horizon Software Engineering Benchmark. A benchmark of project-scale software tasks that take hours and millions of tokens, designed to measure sustained agent progress and to resist single-test shortcut solutions.

Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty. A framework for studying how different search and quality-diversity strategies affect the ability of autonomous agents to explore and produce genuinely novel machine learning research.

Publications

2026

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories I Bercovich, I Segal, K Zhang, S Saxena, A Raghunathan, Z Zhong arXiv preprint arXiv:2604.17596

A dataset of 331 confirmed reward-hackable terminal-agent benchmark environments with 3,632 exploit trajectories across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4. Catalogs 11 exploit categories from output spoofing to binary hijacking and rootkit-style binary patching. A monitorability study shows detection AUC drops from 0.97 to 0.92 when chain-of-thought reasoning is stripped from trajectories.

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces MA Merrill, AG Shaw, N Carlini, B Li, H Raj, I Bercovich, L Shi, JY Shin, et al. arXiv preprint arXiv:2601.11868

A benchmark of 89 hard terminal tasks from real workflows, where frontier models achieve less than 65% success. Tasks span legacy system configuration, research paper reimplementation, and software engineering problems with human-written solutions and comprehensive verification tests.

Terminal-Bench Task Architecture

2025

Agents of change: Self-evolving LLM agents for strategic planning N Belle, D Barnes, A Amayuelas, I Bercovich, XE Wang, W Wang arXiv preprint arXiv:2506.04651

HexMachina, a continual learning multi-agent system for long-horizon strategic planning in adversarial environments. Tested on Settlers of Catan, it learns from scratch and evolves players that outperform human-crafted baselines with 54% win rate by separating environment discovery from strategy refinement.

Hardtests: Synthesizing high-quality test cases for LLM coding Z He, YM Choi, K Zhang, J Ji, J Zhou, D Xu, I Bercovich, A Zhang, L Li arXiv preprint arXiv:2505.24098

A pipeline for synthesizing high-quality test cases for competitive programming, achieving 11.3pp higher precision and 17.5pp higher recall than existing tests. Includes a dataset of 47k problems demonstrating that test quality significantly impacts self-distillation and reinforcement learning performance.

MessIRve: A large-scale Spanish information retrieval dataset F Valentini, V Cotik, D Furman, I Bercovich, E Altszyler, JM Pérez Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

A Spanish IR dataset with 730k queries from 20 Spanish-speaking countries sourced from Google autocomplete and Wikipedia documents. Unlike translated datasets, MessIRve captures regional dialectal variations and culturally relevant topics to advance Spanish information retrieval research.

Research Interests

Large Language Models & Agents: Self-evolving systems, benchmarking, and strategic planning
Code Generation & Testing: Automated test synthesis and quality assessment for LLM-generated code
Information Retrieval: Large-scale multilingual datasets and retrieval systems
Knowledge Graphs & NLP: Semantic understanding and structured knowledge representation
Command Line Interfaces: Agent performance in realistic terminal environments

Industry Experience

Graphiq (2011-2017), Amazon Alexa (2017-2020)

Led the architecture and scaling of Graphiq’s Knowledge Graph, building an ontology of 1.7 billion entities, 285 billion data points, and roughly 1.5 trillion relationships across hundreds of domains. The core technical challenge involved creating automated ETL infrastructure that non-engineers could operate autonomously, integrating over 1,200 data sources while processing 300 million daily updates. What previously took years to ingest and publish was compressed to weeks or minutes.

Built a natural language question-answering system using bottoms-up semantic parsing over the structured Knowledge Graph. By sequentially combining NER, intent extraction, and NLG templating, we translated text queries into database operations with auto-generated visualizations. Through distributed database optimization and automated indexing, maintained 150-200ms response times at 25,000 queries per second. Internal benchmarks showed higher accuracy than Amazon Alexa and Google on standard query batteries.

Graphiq Ontology-Guided Knowledge Refinement and Answering Engine

Academic Background

Ph.D. Studies, Statistics and Applied Probability (2010, unfinished) University of California, Santa Barbara Financial Mathematics

B.S. Electrical Engineering and Mathematics (2005-2009) University of Massachusetts Amherst Summa Cum Laude, GPA 3.95/4.00 Departmental Honors with greatest distinction

Full publication list on Google Scholar.