There is a class of AI tools that never trends on social media but quietly determines whether the AI applications you use every day actually work well. Embedding models live in this unglamorous corner of machine learning, and yet they are the reason a chatbot can find the right document in a sea of thousands, why semantic search feels like it reads your mind, and why retrieval-augmented generation does not hallucinate as often as it should. Google just released two of them in quick succession, and the implications stretch well beyond developer circles.

Gemini Embedding 2 and EmbeddingGemma are not competing products — they solve different problems for different environments. Understanding what each one does, why it matters, and where it fits is worth your time whether you are building AI applications professionally or simply trying to make sense of where this technology is heading.
What Is an Embedding Model, and Why Should You Care?
Before diving into the specifics, a quick grounding in what embeddings actually do is worth the detour.
When you type a sentence into a search bar, a computer does not understand language the way you do. What it can understand is numbers. An embedding model takes your text, image, audio clip, or video and converts it into a long list of numbers called a vector. That vector is not random — it encodes meaning. Two sentences that mean the same thing will produce vectors that sit close together in mathematical space. Two sentences about completely unrelated topics will sit far apart.
This is the difference between keyword search and semantic search. Keyword search looks for exact word matches. Semantic search finds meaning, even when the words do not match at all. Ask a keyword system about “heart attack” and it might miss a document that says “myocardial infarction.” A good embedding model will know those two phrases belong together.

Every time you use a RAG-powered chatbot, a document retrieval system, a recommendation engine, or a smart filing tool, there is an embedding model quietly doing the heavy lifting. The quality of that model determines how useful the whole system feels.
Gemini Embedding 2: One Space to Rule Them All
What Google Actually Built Here
Gemini Embedding 2 is Google’s first natively multimodal embedding model, mapping text, images, video, audio, and documents into a single unified embedding space — enabling cross-modal retrieval and classification across different types of media. It is available now in public preview through the Gemini API and Vertex AI.
That single embedding space is the critical phrase. Before Gemini Embedding 2, if you wanted to build a system that could search across both text documents and videos, you needed separate models for each modality and some layer of engineering glue to connect their outputs. The outputs lived in different mathematical spaces and did not meaningfully communicate with each other. Gemini Embedding 2 eliminates that friction.
The Modality Breakdown
The model supports up to 8,192 input tokens for text (four times the limit of its predecessor), up to six images per request in PNG and JPEG formats, videos up to 120 seconds long, and PDF documents up to six pages. Each of these media types lands in the same vector space, meaning you can run a search query as text and retrieve results that are videos, images, or documents — not because a separate system matched them by category, but because their meanings are geometrically close.
The audio capability deserves its own mention. The model processes audio natively without converting it to text first. Most previous approaches rely on a speech-to-text step in between, which tends to lose information along the way. Gemini Embedding 2 skips that entirely. This matters more than it sounds. Tone, pacing, and speaker emphasis carry meaning that transcripts silently erase. A native audio embedding preserves more of what the speaker actually communicated.
Interleaved Input — The Underrated Feature
Google calls one of its capabilities “interleaved input,” where developers can mix multiple modalities in a single request, like pairing an image with a text description. Google says this helps the model pick up on relationships between different media types better than embedding each one on its own.
Think about what this unlocks. A user uploads a photo of a product with a text description of what they are looking for. Instead of matching the image to product images and the text to product descriptions separately, the model takes both together and understands what the user wants as a unified intent. The result should be dramatically more accurate retrieval than any pipeline stitching two single-modal models together.
Performance Against Competitors
Google published benchmark comparisons against Amazon’s Nova 2 Multimodal Embeddings and Voyage Multimodal 3.5. The gap is widest in text and video tasks, where Gemini Embedding 2 reaches up to 68.8 points while Amazon Nova 2 lands at 60.3 and Voyage Multimodal 3.5 at 55.2. In text-image comparisons, Google also leads with 93.4 versus Amazon’s 84.0.
These are Google’s own published numbers, so a degree of healthy skepticism is appropriate. Independent benchmarks will tell a more complete story over time. That said, the margins shown are not trivially small.
Where It Lives and What It Costs
The model is accessible via the Gemini API and Vertex AI. It also integrates with LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB, and Vector Search, which means developers already invested in those ecosystems can drop it into an existing pipeline without starting over. Pricing follows Vertex AI’s standard structure, which ties cost to token volume rather than a flat per-request fee.
EmbeddingGemma: Powerful AI That Fits in Your Pocket
The Other Release That Did Not Get Enough Attention
While Gemini Embedding 2 targets cloud-scale applications, EmbeddingGemma is solving a completely different problem. Designed specifically for on-device AI, its highly efficient 308 million parameter design enables developers to build applications using techniques such as Retrieval Augmented Generation and semantic search that run directly on hardware — delivering private, high-quality embeddings that work anywhere, even without an internet connection.

The “without an internet connection” part is the real provocation here. Every cloud-based AI model you use requires a network round-trip — your query leaves your device, reaches a data center, gets processed, and a response comes back. EmbeddingGemma breaks that dependency. The model runs entirely on the device in your hand or on your desk.
Under 200MB and Built to Actually Fit on a Phone
EmbeddingGemma is a 308 million parameter multilingual text embedding model based on Gemma 3, optimized for use in everyday devices such as phones, laptops, and tablets. It runs on less than 200MB of RAM with quantization and delivers embeddings in under 22ms on EdgeTPU.
200MB of RAM for a capable multilingual embedding model is a genuinely remarkable number. For reference, many smartphone apps use more memory than that during normal operation. Getting a model that can meaningfully understand language across over 100 languages into that footprint required architectural decisions that are worth briefly examining.
The Architecture Behind the Efficiency
EmbeddingGemma builds on the Gemma 3 transformer backbone, but modified to use bidirectional attention instead of causal (one-way) attention. This means earlier tokens in the sequence can attend to later tokens, effectively turning the architecture from a decoder into an encoder. Encoder models can outperform LLMs on embedding tasks like retrieval.
To put this in plain terms: most large language models read text in one direction, generating each word without the ability to go back and reconsider earlier words in light of what came later. EmbeddingGemma reads the whole input at once and uses every part of the sentence to understand every other part. This makes it much better at capturing the meaning of a full passage rather than just predicting what comes next.
The model outputs a 768-dimensional vector by default, with smaller options available at 512, 256, or 128 dimensions via Matryoshka Representation Learning, which allows users to truncate the output and re-normalize for efficient, accurate representation. This flexibility matters in practice — a mobile app might want 128-dimensional vectors because they search faster and take less storage, while a server-side application can afford the full 768-dimensional precision.
What You Can Actually Build With This
EmbeddingGemma unlocks use cases like searching across personal files, texts, emails, and notifications simultaneously without an internet connection, building personalized offline chatbots through RAG with Gemma 3, and classifying user queries to relevant function calls to help mobile agent understanding.
Offline personal search is the one that feels most immediately transformative. Imagine a notes app that actually understands what you wrote three months ago, a local search over your entire email history that finds conceptually related messages even when you cannot remember the exact words, or a mobile assistant that works on a plane without Wi-Fi. None of this requires sending your private data to a server.
EmbeddingGemma is already integrated with sentence-transformers, llama.cpp, MLX, Ollama, LiteRT, Transformers.js, LMStudio, Weaviate, Cloudflare, LlamaIndex, and LangChain. The tooling ecosystem arrived essentially at launch, which shortens the path from idea to working application considerably.
EmbeddingGemma vs Gemini Embedding 2: Which One Do You Actually Need?
This question comes up immediately, and it deserves a direct answer.
If you are building a cloud-hosted application that processes multiple media types — documents, images, audio, video — and you need top benchmark performance with the ability to handle interleaved inputs, Gemini Embedding 2 is your tool. It lives behind an API, scales horizontally with your infrastructure, and benefits from Google’s data center hardware. You pay per token.
If you are building something that must run locally, must protect user privacy, must work offline, or must run efficiently on a phone or laptop, EmbeddingGemma is the right choice. It is open source, freely available on Hugging Face, Kaggle, and Vertex AI Model Garden, and it runs on hardware most people already own.
The two models are not rivals. They occupy different parts of the stack. A sophisticated application might actually use both — EmbeddingGemma for fast, private, on-device retrieval and Gemini Embedding 2 for deeper, cloud-based cross-modal search when connectivity is available and the task demands it.
What Does Matryoshka Representation Learning Mean and Why Does It Keep Coming Up?
Both models use a technique called Matryoshka Representation Learning, abbreviated as MRL, and it appears frequently enough in the documentation that it deserves a proper explanation.
A standard embedding model produces a fixed-size vector, say 1536 numbers. If you want a smaller vector for efficiency, you have to train a completely separate model. MRL changes this by nesting information inside the vector hierarchically. The first 128 numbers of the vector capture the most important semantic meaning. The next 128 add more nuance. The next 256 add more still, all the way up to the full 768 or 1536.
This means a single trained model can serve many use cases. A fast mobile search might use only the first 128 dimensions. A high-stakes document retrieval system uses all 768. You are not making a binary choice between a good model and a fast model — you tune the tradeoff at inference time without retraining anything.
It is an elegant solution to a real problem that previously required maintaining multiple models at different quality tiers. The fact that both of Google’s new embedding releases ship with MRL built in suggests it has become a non-negotiable feature in serious embedding infrastructure.
The Bigger Picture: Why Google Is Moving Fast Here
It would be easy to view these releases as routine model updates, but the timing and scope suggest something more deliberate. The embedding model market has gotten genuinely competitive. Cohere, Voyage AI, Amazon, and open-source communities have all produced strong models in the past year. OpenAI’s text-embedding-3-large held significant mindshare until recently. The MTEB leaderboard, which benchmarks text embedding quality, has become a competitive arena that major labs watch closely.
Google’s Gemini embedding model has consistently secured the top position on the Massive Text Embedding Benchmark (MTEB) leaderboard, a leading industry text-embedding benchmark. Maintaining that position while also expanding into multimodal territory and launching an open-source on-device option is a multi-front move that covers both enterprise cloud customers and independent developers.
There is also a strategic angle around the AI application layer. As more developers build RAG-powered products — which is most of what enterprise AI looks like today — the embedding model becomes the foundation those products depend on. A developer who builds their search infrastructure on Gemini Embedding 2 is, for practical purposes, inside the Google AI ecosystem. Offering EmbeddingGemma as an open-source on-device alternative lowers the barrier to entering that ecosystem without requiring API spend from day one.
How to Get Started With Each Model
Getting Started With Gemini Embedding 2
The model is available through the Gemini API under the identifier gemini-embedding-2-preview. A basic call to embed text looks like this in Python:
from google import genaiclient = genai.Client()
result = client.models.embed_content(
model="gemini-embedding-2-preview",
contents="What caused the 2008 financial crisis?"
)
print(result.embeddings)
For multimodal inputs, you pass a list of content parts combining text, images, or audio in a single request.Google’s Colab notebooks provide working examples for each modality combination.
Getting Started With EmbeddingGemma
EmbeddingGemma is on Hugging Face under google/embeddinggemma-300m. Using it via Sentence Transformers takes about five lines:
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("google/embeddinggemma-300m")embeddings = model.encode([
"What are the symptoms of dehydration?",
"Signs that you need to drink more water."
])
print(embeddings.shape) # (2, 768) — two 768-dim vectors
Those two sentences will produce vectors that sit very close to each other because the model understands they mean the same thing. That is the whole value proposition, running locally in under 200MB.
Frequently Asked Questions About Google’s New Embedding Models
Is Gemini Embedding 2 free to use?
It is available in public preview through the Gemini API and Vertex AI. Preview access may include free tiers, but production workloads at scale will incur costs based on token volume. Check the current Vertex AI pricing page for the most accurate numbers, as preview pricing can change at general availability.
Can EmbeddingGemma handle languages other than English?
Yes. EmbeddingGemma supports over 100 languages and is trained on wide linguistic data specifically for multilingual understanding. The MMTEB benchmark, which tests multilingual embedding performance, is where the model achieved its best-in-class ranking for models under 500 million parameters.
What is the difference between EmbeddingGemma and Gemini Embedding 2 in simple terms?
EmbeddingGemma runs on your device, handles text, works offline, and is free and open source. Gemini Embedding 2 runs in the cloud, handles text, images, audio, video, and documents together, delivers higher peak performance, and costs money at scale. One is a local tool; the other is a cloud service.
Can I fine-tune EmbeddingGemma for my specific domain?
EmbeddingGemma can be fine-tuned for specific domains. Google demonstrated this by fine-tuning it on the Medical Instruction and Retrieval Dataset (MIRIAD), producing a model that outperformed models twice its size on medical document retrieval. The fine-tuning quickstart notebook is available in the Gemma Cookbook repository.
Does Gemini Embedding 2 work with BigQuery?
Yes. Google announced a major expansion of text embedding capabilities in BigQuery ML, allowing developers to use the Gemini embedding model and over 13,000 open-source models directly within BigQuery using simple SQL commands.
What This Means for Anyone Building AI Products in 2025 and Beyond
The release of two embedding models solving two different problems at the same time signals something worth paying attention to. The “smart” layer of AI applications is migrating downward — not just to data centers but onto the devices people carry. EmbeddingGemma is early evidence that high-quality semantic understanding no longer requires cloud infrastructure. That has real consequences for privacy, latency, and accessibility.
Meanwhile, Gemini Embedding 2 is evidence that the cloud-side models are not standing still. The ability to unify text, audio, image, and video in a single vector space is not a marginal improvement over the previous generation.It is a qualitative shift in what you can retrieve and how you can query it.
For developers, the practical implication is worth sitting with. The embedding layer of your application is not just a commodity component you pick and forget. The model you choose shapes what your users can find, how accurately your system understands their intent, and what data modalities your product can meaningfully work with. Both of these releases raise the floor for what “good” looks like.
The search for meaning in unstructured data — documents, conversations, recordings, videos — has always been the harder problem underneath every AI application that actually ships. Google just made two very capable bets on how to solve it.