Current giant language fashions (LLMs) developments sparked a rising analysis curiosity in software assisted LLMs fixing real-world challenges, which requires complete analysis of tool-use capabilities. Whereas earlier works targeted on both evaluating over stateless net companies (RESTful API), based mostly on a single flip consumer immediate, or an off-policy dialog trajectory, ToolSandbox consists of stateful software execution, implicit state dependencies between instruments, a built-in consumer simulator supporting on-policy conversational analysis and a dynamic analysis technique for intermediate and ultimate milestones over an arbitrary trajectory. We present that open supply and proprietary fashions have a big efficiency hole, and sophisticated duties like State Dependency, Canonicalization and Inadequate Data outlined in ToolSandbox are difficult even essentially the most succesful SOTA LLMs, offering brand-new insights into tool-use LLM capabilities.