Interactive digital brokers (IDAs) leverage APIs of stateful digital environments to carry out duties in response to person requests. Whereas IDAs powered by instruction-tuned massive language fashions (LLMs) can react to suggestions from interface invocations in multi-step exchanges, they haven’t been skilled of their respective digital environments. Prior strategies accomplish lower than half of duties in refined benchmarks reminiscent of AppWorld. We current a reinforcement studying (RL) method that trains IDAs immediately of their goal environments. We formalize this coaching as {a partially} observable Markov determination course of and derive LOOP, a data- and memory-efficient variant of proximal coverage optimization. LOOP makes use of no worth community and maintains precisely one copy of the underlying LLM in reminiscence, making its implementation simple and as memory-efficient as fine-tuning a single LLM. A 32-billion-parameter agent skilled with LOOP within the AppWorld atmosphere outperforms the a lot bigger OpenAI o1 agent by 9 proportion factors (15% relative). To our data, that is the primary reported utility of RL to IDAs that work together with a stateful, multi-domain, multi-app atmosphere by way of direct API calls. Our evaluation sheds mild on the effectiveness of RL on this space, exhibiting that the agent learns to seek the advice of the API documentation, keep away from unwarranted assumptions, reduce confabulation, and get well from setbacks.