LLM Deep-dive: Gemini

Matouš Eibich & Stefan Wendin
January 9, 2024

Has the king been dethroned? 

Since March 2023, GPT-4 has stood as the undisputed leader among Large Language Models, a significant leap ahead of its predecessors and a benchmark for new entrants. Competitors have often been judged successful if they managed to surpass GPT-3.5, underlining the advanced nature of GPT-4. Yet, the recent announcement of Google's Gemini model could signal a change in this dynamic. Gemini's groundbreaking approach to multimodal processing, integrating image, audio, video, and text data, sets it apart in the field of AI. Reports suggest that Gemini outperforms GPT-4 in several benchmarks, yet its introduction has been mired in controversy. Criticisms have emerged over Google’s presentation, which included a video that overstated the model's capabilities and their blog post, which downplayed instances where GPT-4 still held the upper hand.

Training Data and Architecture

Gemini models are built upon Transformer decoders, but with enhancements in architecture and model optimization. These improvements are crucial for enabling stable training at large scales and for optimized performance on Google’s Tensor Processing Units (TPUs), enabling the handling of a 32k token context length. 

A key aspect of Gemini's design is its multimodal training regimen, which incorporates a diverse blend of data including images, audio, video, and text. This approach allows Gemini to engage with a broader spectrum of information types, providing it with a more versatile toolkit compared to traditional text-centric LLMs. By integrating these varied data formats, Gemini offers a more rounded and adaptable AI model, advancing the field of Large Language Models with its practical and inclusive data handling capabilities.

Gemini grading physics problem, source: Gemini technical report
Gemini helping with cooking an omelette, source: Gemini technical report

Different Versions of the Model

The Gemini model family is designed to cater to a wide array of applications and computational needs, from complex reasoning to on-device applications, manifesting in three distinct versions: Ultra, Pro, and Nano. Each of these versions is uniquely tailored to meet specific performance and deployment criteria:

Performance Benchmarks

The Gemini Ultra model represents a significant leap in AI capabilities, as evidenced by its exceptional performance across a wide range of benchmarks. Key highlights include:

However, readers are advised to consider these highlighted results in light of the controversies discussed further in this article, which call for a careful examination of Google’s claims and remind us of the need for independent verification of such benchmarks.

Controversy Surrounding the Model

The unveiling of Google's Gemini model has not been without its share of controversy, highlighting the complexities and challenges in presenting and evaluating cutting-edge AI technologies. Two major points of contention have emerged, drawing significant attention and critique from the AI community.

Comparison with GPT-4

The benchmarks released by Google suggest that Gemini may outperform GPT-4 in certain reasoning and math tasks, yet these results should be met with a healthy dose of skepticism. Given the recent controversies, including the use of an edited video and selective benchmark reporting, Google's credibility in presenting their model's capabilities has been called into question. Until these results can be independently validated, it's prudent to reserve judgment and consider the full context of Gemini's performance relative to GPT-4, acknowledging the broader discussion about accurate and unbiased AI model evaluation.

Gemini Ultra vs GPT-4 on MMLU, source: Google


The introduction of Google's Gemini model to the competitive landscape of Large Language Models, with its advanced multimodal capabilities, is a noteworthy event. Its impressive performance could be a game-changer if further evaluations uphold Google's claims. However, the model's true standing, particularly in comparison to GPT-4, will hinge on unbiased, independent validation in the times ahead. 

Learn more