Sitemap

Deploying Google’s Gemma 3 (4B) on Cloud Run: A Practical Guide

5 min readMar 12, 2025
https://www.youtube.com/watch?v=UU13FN2Xpyw

Introduction

Google has just released Gemma 3, their newest generation of open language models, and it’s creating waves in the AI community. As the official announcement states:

“Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.”

This is a significant leap forward from previous versions, with Google claiming it’s now “the top open compact model in LMArena, with a score of 1338” thanks to its optimized training process using distillation, reinforcement learning, and model merging.

In this post, we’ll dive into how to deploy the Gemma 3 (4B parameter) model on Google Cloud Run using Ollama, creating a production-ready API that you can integrate with your own applications. While Google offers several deployment options including their official Cloud Run tutorial, we’ll explore a streamlined approach with some practical optimizations.

The deployment approach outlined here uses a consolidated script that handles the entire process, from setting up the Google Cloud environment to deploying the Gemma 3 model on Cloud Run with GPU acceleration.

Prerequisites

Before we begin, make sure you have the following:

  1. Google Cloud Platform account with billing enabled
  2. gcloud CLI installed and configured
  3. Google Cloud project with Cloud Run API and Artifact Registry API enabled
  4. Basic knowledge of Docker and APIs
  5. Excitement about working with cutting-edge AI technology (this one’s important!)

Google has made Gemma 3 available through multiple platforms including Hugging Face, Kaggle, and Ollama, which is what we’ll be using in this tutorial.

Understanding the Architecture

Our deployment consists of a single service:

Gemma Ollama Service: This service hosts the Gemma 3 (4B) model using Ollama, an open-source model serving framework.

The service is deployed on Google Cloud Run, which offers a serverless compute platform for containerized applications with automatic scaling and pay-per-use pricing.

The Deployment Script

To simplify the deployment process, we’ve created a consolidated script called belha-deploy.sh that handles the entire workflow. Here's an overview of what the script does:

#!/bin/bash
# Set environment variables
PROJECT_ID=$(gcloud config get-value project)
REGION="us-central1"
REPOSITORY="gemma-repo"
OLLAMA_SERVICE_NAME="ollama-gemma"
SERVICE_ACCOUNT="ollama-service"
# Set up Google Cloud environment
gcloud config set run/region $REGION
gcloud artifacts repositories create $REPOSITORY --repository-format=docker --location=$REGION || true
gcloud iam service-accounts create $SERVICE_ACCOUNT || true
# Build and deploy Gemma Ollama service
# [Dockerfile and build commands]

The script creates all necessary resources with consistent naming, ensuring that the resources are easily identifiable and organized.

Deploying Gemma 3 (4B) with Ollama

The heart of our setup is the Gemma 3 (4B) model deployed using Ollama. The script creates a Docker container that:

  1. Uses the Ollama base image
  2. Sets appropriate environment variables
  3. Pulls the Gemma 3 (4B) model
  4. Exposes the service on port 8080

Here’s the Dockerfile definition used for the Gemma Ollama service:

FROM ollama/ollama:0.6.0
ENV OLLAMA_HOST 0.0.0.0:8080
ENV OLLAMA_MODELS /models
ENV OLLAMA_DEBUG false
ENV OLLAMA_KEEP_ALIVE -1
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL
ENTRYPOINT ["ollama", "serve"]

This container is then deployed to Cloud Run with the following configuration:

  • 8 CPUs
  • 32GB memory
  • CPU boost enabled
  • No public access (authentication required)
  • Appropriate timeout and concurrency settings

Using the Deployed Gemma API

Once deployed, you can interact with the Gemma 3 (4B) model through the Ollama API. Here’s an example of generating text:

curl -X POST "https://your-gemma-service-url.a.run.app/api/generate" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-identity-token)" \
-d '{"model": "gemma3:4b", "prompt": "Write a short poem about AI", "stream": false}'

The API response includes the generated text along with model metadata:

{
"model": "gemma3:4b",
"created_at": "2025-03-12T09:14:19.068446Z",
"response": "A silent mind, of code and light,\nBorn of logic, sharp and bright.\nIt learns and grows, a mimic's grace,\nReflecting patterns, time and space.\n\nNo heart it holds, no soul to claim,\nJust algorithms, a digital name.\nA tool, a wonder, strange and new,\nThe ever-evolving, AI view.",
"done": true
}

Performance Considerations

When deploying Gemma 3 (4B) on Cloud Run, it’s important to understand the resource requirements and performance characteristics:

  1. Memory Usage: The 4B parameter model requires a significant amount of memory (32GB is recommended for optimal performance)
  2. CPU Requirements: We’ve allocated 8 CPUs with CPU boost to ensure low-latency responses
  3. Cost Optimization: Cloud Run’s pay-per-use model means you only pay for the time your service is handling requests
  4. Cold Start Times: The first request after inactivity may experience a delay as the container spins up

Scaling and Security

Our deployment includes several best practices for scaling and security:

  1. Authentication: All API endpoints are protected using Google Cloud’s identity tokens
  2. Service Account: A dedicated service account with minimal permissions
  3. Auto-scaling: Cloud Run automatically scales based on traffic
  4. Resource Limits: CPU and memory limits prevent runaway costs

Integrating with Your Applications

The Gemma 3 model can be integrated into a variety of applications:

  1. Web Applications: Call the API from frontend JavaScript or backend servers
  2. Chat Interfaces: Use it to power conversational agents
  3. Content Generation: Generate articles, summaries, or creative text
  4. Code Assistance: Get help with programming tasks

Conclusion

Gemma 3 represents a significant advancement in open language models, bringing capabilities that were previously only available in much larger, closed models. As Google notes in their release:

“Gemma’s pre-training and post-training processes were optimized using a combination of distillation, reinforcement learning, and model merging. This approach results in enhanced performance in math, coding, and instruction following.”

Deploying Gemma 3 (4B) on Cloud Run provides a powerful, scalable, and cost-effective way to leverage this advanced language model in your applications. By using the consolidated deployment script and Ollama, we’ve simplified what could otherwise be a complex process, allowing you to focus on building amazing applications with this cutting-edge technology.

The experience of interacting with a model that understands over 140 languages, can handle 128k context windows, and offers significantly improved reasoning capabilities — all running on your own infrastructure — is truly transformative for developers looking to build the next generation of AI-powered applications.

In future posts in this series, we’ll explore:

  • Fine-tuning Gemma 3 for specific domains
  • Performance optimization and cost management techniques
  • Advanced prompt engineering for Gemma models
  • Exploring Gemma 3’s multimodal capabilities

Resources

--

--

Timothy
Timothy

Written by Timothy

Software / DevOps Engineer | Google Developer Expert for Cloud | https://timtech4u.dev/

No responses yet