Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token context window. Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token context window.

Quota limits for legacy models

This page gives you information about quotas and limits of legacy models. The models in a legacy model family are no longer updated with new stable versions. For details, see Legacy model information.

Google Cloud uses quotas to help ensure fairness and reduce spikes in resource use and availability. A quota restricts how much of a Google Cloud resource your Google Cloud project can use. Quotas apply to a range of resource types, including hardware, software, and network components. For example, quotas can restrict the number of API calls to a service, the number of load balancers used concurrently by your project, or the number of projects that you can create. Quotas protect the community of Google Cloud users by preventing the overloading of services. Quotas also help you to manage your own Google Cloud resources.

The Cloud Quotas system does the following:

Monitors your consumption of Google Cloud products and services
Restricts your consumption of those resources
Provides a means to request changes to the quota value

In most cases, when you attempt to consume more of a resource than its quota allows, the system blocks access to the resource, and the task that you're trying to perform fails.

Quotas generally apply at the Google Cloud project level. Your use of a resource in one project doesn't affect your available quota in another project. Within a Google Cloud project, quotas are shared across all applications and IP addresses.

Quotas by region and model

The requests per minute (RPM) quota applies to a base model and all versions, identifiers, and tuned versions of that model. For example, a request to text-bison and a request to text-bison@002 are counted as two requests toward the RPM quota of the base model, text-bison. The same applies to tuned models, so a request to chat-bison@002 and a tuned model based on chat-bison@002 named my-tuned-chat-model are counted as two requests toward the base model, chat-bison.

The quotas apply to Generative AI on Vertex AI requests for a given Google Cloud project and supported region.

To view the quotas in the Google Cloud console, do the following:

In the Google Cloud console, go to the IAM & Admin Quotas page.

View Quotas in Console
In the Filter field, specify the dimension or metric.
- Dimension: The model identifier. For example, base_model:gemini-1.0-pro or base_model:text-bison.
- Metric: The quota identifier for PaLM 2 models is aiplatform.googleapis.com/online_prediction_requests_per_base_model

Choose a region to view the quota limits for each available model:

Batch quotas

The following quotas and limits are the same across the regions for Generative AI on Vertex AI batch prediction jobs:

Quota	Value
`text_bison_concurrent_batch_prediction_jobs`	4
`code_bison_concurrent_batch_prediction_jobs`	4

Custom-trained model quotas

The following quotas apply to Generative AI on Vertex AI tuned models for a given project and region:

Quota	Value
Restricted image training TPU V3 pod cores per region * supported Region - europe-west4	64
Restricted image training Nvidia A100 80GB GPUs per region * supported Region - us-central1 * supported Region - us-east4	8 2

* Tuning scenarios have accelerator reservations in specific regions. Quotas for tuning are supported and must be requested in specific regions.

Online evaluation quotas

The evaluation online service uses the text-bison model as an autorater with Google IP prompts and mechanisms to ensure consistent and objective evaluation for model-based metrics.

A single evaluation request for a model-based metric might result in multiple underlying requests to the online prediction service. Each model's quota is calculated on a per-project basis, which means that any requests directed to the text-bison for model inference and model-based evaluation contribute to the quota. Different model quotas are set differently. The quota for the evaluation service and the quota for the underlying autorater model are shown in the table.

Request quota	Default quota
Online evaluation service requests per minute	1,000 requests per project per region
Online prediction requests per minute for base_model, base_model: `text-bison`	1,600 requests per project per region

If you receive an error related to quotas while using the evaluation online service, you might need to file a quota increase request. See View and Manage Quotas for more information.

Limit	Value
Online evaluation service request timeout	60 seconds

First-time users of the online evaluation service within a new project might experience an initial setup delay generally up to two minutes. This is a one-time process. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.

The maximum input and output tokens are limited for the model-based metrics as per the model used as the autorater. See Model information | Generative AI on Vertex AI | Google Cloud for limits for relevant models.

Pipeline evaluation quotas

If you receive an error related to quotas while using the evaluation pipelines service, you might need to file a quota increase request. See View and Manage Quotas for more information.

The evaluation pipelines service uses Vertex AI Pipelines to run PipelineJobs. See relevant quotas for Vertex AI Pipelines. The following are general quota recommendations:

Service	Quota	Recommendation
Vertex AI API	Concurrent LLM batch prediction jobs per region	Pointwise: 1 * num_concurrent_pipelines Pairwise: 2 * num_concurrent_pipelines
Vertex AI API	Evaluation requests per minute per region	1000 * num_concurrent_pipelines

Additionally, when calculating model-based evaluation metrics, the autorater might hit quota issues. The relevant quota depends on which autorater was used:

Tasks	Quota	Base model	Recommendation
`summarization` `question_answering`	Online prediction requests per base model per minute per region per base_model	`text-bison`	60 * num_concurrent_pipelines

Vertex AI Pipelines

Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.

Quota increases

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.

What's next

Learn about Vertex AI quotas and limits.
Learn more about Google Cloud quotas and limits.