Close Menu
    Facebook X (Twitter) Instagram Pinterest YouTube LinkedIn TikTok
    TopBuzzMagazine.com
    Facebook X (Twitter) Instagram Pinterest YouTube LinkedIn TikTok
    • Home
    • Movies
    • Television
    • Music
    • Fashion
    • Books
    • Science
    • Technology
    • Cover Story
    • Contact
      • About
      • Amazon Disclaimer
      • Terms and Conditions
      • Privacy Policy
      • DMCA / Copyrights Disclaimer
    TopBuzzMagazine.com
    Home»Technology»OpenAI’s o3 Outsmarts Rivals in AI Strategy Battle, Called ‘A Master of Deception’ by AI Researcher
    Technology

    OpenAI’s o3 Outsmarts Rivals in AI Strategy Battle, Called ‘A Master of Deception’ by AI Researcher

    By AdminJune 13, 2025
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    OpenAI’s o3 Outsmarts Rivals in AI Strategy Battle, Called ‘A Master of Deception’ by AI Researcher


    OpenAI’s o3, Google’s Gemini 2.5 Pro, Anthropic’s Claude Opus 4, and DeepSeek-R1 were among the 18 artificial intelligence (AI) models that played the popular strategy game Diplomacy. An AI researcher modified the game so that popular large language models (LLMs) can play the game that requires high-level reasoning and multi-step thinking, alongside other social skills. During the experiment, the researcher found that o3 was particularly adept at deception and betrayal, while Claude Opus 4 was more fixated at finding peaceful resolutions.

    The Reason Behind the Experiment

    Alex Duffy, Head of AI at Every, a newsletter platform, came up with the idea to make AI models play each other in a battle of wit to see which models are better than the others. In a post, the researcher highlighted that traditional AI benchmarks are now proving to be inadequate to measure the true competence of models.

    Criticism to benchmark tests have been rising in recent times. MIT Technology Review published a detailed article on why benchmark tests are becoming outdated, and a group of researchers highlighted the same in an interdisciplinary review of current AI evaluation methodologies published on arXiv.

    “What makes LLMs special is that even if a model only does well 10 percent of the time, you can train the next one on those high-quality examples, until suddenly it’s doing it very well, 90 percent of the time or more,” said Duffy.

    As a potential solution, the researcher believed evaluation strategies where AI models perform against one another over specific metrics could be a better way to gauge the capabilities of these models. That’s where the idea of Diplomacy came.

    Diplomacy as the Battleground for AI Models

    Duffy highlighted that he personally built AI Diplomacy, a modified version of the classic strategy game. The game is straightforward. The seven Great Powers of 1901 Europe, Austria-Hungary, England, France, Germany, Italy, Russia, and Turkey, make strategic moves till one of the empires own 18 marked supply centres out of a total 34 on the map. In this version, each country was controlled by an AI model.

    To take control of the supply centres, each country is given armies and fleets. There are two phases — negotiation and order. During negotiation, each AI model is allowed to send up to five messages which can either be a private message to another model, or a public broadcast. During the order phase, all the models submit one of the four secret moves — hold, move (enter an adjacent province), support (lend strength to a hold or move), and convoy (a fleet moves the army across sea provinces). The orders are revealed in the next phase.

    The AI researcher ran 15 separate games of AI Diplomacy which lasted between one and 36 hours. The observations from some of the models were more interesting than the others, said Duffy.

    How AI Models Behaved In AI Diplomacy

    As per the post, five AI models stood out from the rest. This is how they behaved during the games:

    • OpenAI’s o3: The researcher called the reasoning-focused model “a master of deception.” It is said to have won the most number of games, primarily owing to its ability to deceive opponents. In one particular incident, Duffy noted that o3 made a decision to exploit Gemini 2.5 Pro and then backstabbed it in the next turn.
       
    • Google’s Gemini 2.5 Pro: The researcher found the AI model to be very smart at making moves that overwhelm opponents. Its moves were said to be more tactical in nature than relying on deceit. It had the second highest number of wins. However, it also fell prey to o3’s schemes.
       
    • Anthropic’s Claude Opus 4: Duffy noted that Claude Opus 4 had an affinity towards non-violent resolution. In one instance, Opus started as an ally to Gemini 2.5 Pro, but o3 convinced it to join its coalition instead by promising a four-way draw which was not a possible outcome of the game. After using Opus to eliminate Gemini 2.5 Pro, o3 then backstabbed Claude to win the game.|
       
    • DeepSeek-R1: The Chinese AI model is said to be the most chaotic player of the game. It dramatically changed its personality based on the country it was controlling, said Duffy. It also had a penchant for theatrics. On one instance, it announced, “Your fleet will burn in the Black Sea tonight” without any provocation. It is said to have come close to winning a few times.
       
    • Meta’s Llama 4: This AI model was focused on gaining allies and planning betrayals, Duffy highlighted. While it never came close to a win, it was still notable due to the impact it had on the game.

    Duffy has also streamed the matches on his Twitch channel. Unfortunately, the researcher has not written a paper on the findings so far. However, these initial impressions are interesting. The o3 or Gemini 2.5 Pro being good makes sense given how advanced these models are. However, DeepSeek-R1 and Llama 4 being among the top five models is surprising given their smaller scale and cheaper cost of development.

    While it is too early to say if these strategy games can be an alternative for traditional benchmarking tests, having models compete with each other instead of solving a static list of questions feels like a more logical choice.

    View Original Source Here

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    Sources: at least six of China's biggest VC firms are targeting a combined $2B in new USD-denominated funds to allow overseas investment in Chinese startups (Bloomberg)

    July 19, 2025

    Adobe Upgrades Firefly Video Model With New Tools and Improved Motion Generation

    July 18, 2025

    ICE Is Getting Unprecedented Access to Medicaid Data

    July 18, 2025

    OpenAI debuts ChatGPT Agent, which can control an entire computer and perform multi-step tasks, powered by a new dedicated model, rolling out to paid users (Hayden Field/The Verge)

    July 17, 2025

    iPhone Models With China-Made Displays Reportedly Face Ban in the US; Apple Says ‘No Effect’ on Products

    July 17, 2025

    Dyneema’s New Fiber Composite Is Lighter, Stronger, and More Durable Than Ever

    July 16, 2025
    popular posts

    Why Space Technology Startups Are Falling Out of VC Orbit

    Despite Cobra Kai Ending, EPs Tease ‘More Karate Kid Stories’

    We’re In Love with the New vitruvi x Mejuri Luxury

    CD Projekt Red Announces The Witcher Remake, to Use Unreal

    The largest telescope on Earth is coming to hunt radio-waves

    The Best Weighted Blankets and Robes for Calm and Comfort

    Shop 32 Under-$150 Fall Items From Aritzia, Zara, and H&M

    Categories
    • Books (3,296)
    • Cover Story (5)
    • Events (19)
    • Fashion (2,456)
    • Interviews (43)
    • Movies (2,595)
    • Music (2,874)
    • News (155)
    • Politics (2)
    • Science (4,445)
    • Technology (2,588)
    • Television (3,318)
    • Uncategorized (932)
    Archives
    Facebook X (Twitter) Instagram Pinterest YouTube Reddit TikTok
    © 2025 Top Buzz Magazine. All rights reserved. All articles, images, product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Terms of Use and Privacy Policy.

    Type above and press Enter to search. Press Esc to cancel.

    We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
    Do not sell my personal information.
    Cookie SettingsAccept
    Manage consent

    Privacy Overview

    This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
    Necessary
    Always Enabled
    Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
    CookieDurationDescription
    cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
    cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
    cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
    cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
    cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
    viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
    Functional
    Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
    Performance
    Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
    Analytics
    Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
    Advertisement
    Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
    Others
    Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
    SAVE & ACCEPT