Predicting what’s next is one of the hardest things for AI to do well, especially when the “right answer” doesn’t exist yet. It’s also the difference between an agent that can talk about a problem and one that can actually help organizations make better decisions under real-world uncertainty.
That’s why we’re proud to share that H2O AI Super Agent™ is now #1 on the FutureX leaderboard, a live benchmark designed specifically to evaluate future prediction. In the latest results, H2O AI Super Agent™ outperformed AI agents from OpenAI, Google, DeepSeek, xAI, and others, and H2O.ai holds three of the top four positions overall—demonstrating both performance and consistency.
With a top score of 56.0, H2O.ai sets a new bar for AI-powered future prediction. This result highlights the robustness of our agentic AI approach across domains, question types, and levels of uncertainty.
At H2O.ai, our work is grounded in the convergence of Predictive AI, Generative AI, and Agentic systems. The H2O AI Super Agent™ is built on this foundation, bringing together forecasting, reasoning, and autonomous execution in a single, cohesive system.
This approach is anchored in four core capabilities:
Relentless deep web research
Advanced reasoning pipeline
Predictive AI at the core
Dynamic agent tool building
Together, these capabilities enable an agent that goes beyond answering questions to one that can reason under uncertainty, anticipate what’s likely to happen next, and clearly explain its conclusions.
In this post, we’ll explain what the FutureX Agentic Leaderboard measures, why it matters, and how H2O AI Super Agent™ achieved its top-ranking performance. We’ll also dive deeper into each of these capabilities later in the post.
This achievement is particularly significant as we outperformed the Singapore-based MiroMind's GPT-5 (MiroFlow), which had held the #1 position since October Week 2, 2025 - maintaining the top spot for over four months. H2O.ai also established a clear lead over official submissions from major AI labs official submissions:
The results reinforce a core belief at H2O.ai: strong agentic systems aren’t built on a single model alone. They require orchestration, deep research, reasoning, predictive intelligence, and the ability to adapt dynamically as new information emerges.
FutureX is the largest and most diverse live benchmark for AI agent future prediction. Designed by researchers from ByteDance Seed, Fudan University, Stanford University, and Princeton University (Zeng et al., 2025), it evaluates whether AI agents can accurately predict real-world future events before they occur.
Unlike static benchmarks, FutureX eliminates training-data contamination by design—because the correct answers don’t exist at evaluation time.
As the FutureX authors describe:
Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance.
FutureX Paper
Contamination-Free
By focusing on future prediction, ground-truth answers don't exist in any model's training data, ensuring genuine capability assessment
Real-World Complexity
Agents must navigate actual information flows across 195 websites and 11 domains (Finance, Technology, Sports, Politics, Healthcare, and more) at Scale and Diversity
Approximately 500 events per week requiring analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty
Multi-Metric Evaluation
Different question types (single-choice, multi-choice, ranking, numerical) with difficulty-weighted scoring (Level 1: 10%, Level 2: 20%, Level 3: 30%, Level 4: 40%)
Real-World Prediction Examples: Our Agent in Action
To illustrate the breadth and accuracy of our agent's predictive capabilities, here are actual examples from our winning submission across all four difficulty levels
These examples showcase our agent's versatility across different domains (finance, entertainment, automotive, music, sports), question types (numerical, ranking, multiple choice, binary), and languages (English and Chinese), while maintaining high accuracy even on the most challenging Level 4 predictions that require forecasting under deep uncertainty.
The performance of H2O AI Super Agent™ comes from a deliberate architectural choice: combining deep research, advanced reasoning, predictive analytics, and dynamic tooling into a single orchestrated system.
The agent performs persistent, multi-source research without stopping early. It synthesizes information across hundreds of sites,—critical when forecasting future outcomes that depend on weak signals, emerging trends, and fragmented data.
The H2O AI Super Agent™ uses a structured reasoning pipeline to handle complex, open-ended problems that require more than a single pass or a single model response. This pipeline enables the agent to plan, evaluate, and adapt its approach as new information becomes available. The H2O Super Agent's advanced reasoning pipeline has the following capabilities:
High-level query understanding
Strategic multi-step planning
Self-critique and verification loops
Task tracking across tools and sources
Unlike purely generative systems, H2O AI Super Agent™ draws on H2O.ai’s decade of expertise in predictive AI. It incorporates:
Seasonality detection
Time-series forecasting
Quantitative modeling
Qualitative signal interpretation
This combination allows the agent to reason not just about what is, but what’s likely to happen next.
For each use case, the agent can build its own MCP (Model Context Protocol) server tools, adapting its capabilities to the prediction domain at hand—something static agents struggle to do.
v1.82 (Rank #1, score 56.0): Pass@3 using Claude Sonnet 4.5 with flexible ensembling (majority voting, ML models, smart ranking)
v1.81 (Rank #4, score 51.6): Single pass@1 using Claude Opus 4.5
For teams building on our platform, this flexibility extends to deployment as well: h2oGPTe supports Claude Sonnet, enabling strong reasoning and coding capabilities within enterprise-grade, governed environments.
In 2025, H2O.ai topped GAIA, a benchmark focused on grounded reasoning and real-world problem solving. FutureX raises the bar—measuring whether agentic systems can predict the future under real-world uncertainty, using live information, tools, and multi-step planning.
This is exactly what H2O AI Super Agent™ is built for.
For enterprises in banking, government, healthcare, and other highly regulated industries, future prediction directly impacts risk, compliance, operations, and strategic planning. Accuracy, transparency, and governance aren’t optional—they’re essential.
FutureX leadership is a strong signal that agentic AI is moving beyond answering questions toward systems that can anticipate outcomes and act with confidence. It also aligns with where the ecosystem is heading more broadly, including coding- and tool-centric experiences like Claude Code and Claude Sonnet, which are reshaping how developers and agents work together.
If you’d like to see what #1 looks like in practice:
Request a demo of H2O AI Super Agent™ Explore the FutureX leaderboard and view the latest rankings