No technology has dominated recent headlines more than AI, most notably large language models (LLMs) such as ChatGPT. One of the critical enablers of LLMs is powerful clusters of GPUs, which can train AI models to classify and predict new content with a reasonable degree of accuracy.

Training is the process by which an AI model learns how to respond correctly to users' queries. Inference is the process by which a trained AI model responds to users' queries.

These large clusters of GPUs are expensive to procure, deploy and operate, and are too costly for many enterprises. Cloud services offer alternative, simpler ways to train and use generative AI, albeit with compromises on control and customization. In the future, most enterprises will use multiple deployment methods for training and inference depending on the application’s specific requirements.

Table 1 shows a summary of possible deployment methods for AI workloads broken out by inference and training. The columns represent different deployment methods. The rows represent layers of workload management and show which party, either the customer (C) or the provider (P), is responsible for each layer.

The methods toward the right of Table 1 are less customizable, but easier to implement. The methods toward the left are more customizable, but harder to implement. This report provides an overview of these different deployment methods.

Table 1
– Uptime Institute

Training deployment methods

Provider-trained

In a provider-trained method, the cloud provider manages the model's training and inference. There are two variations:

Software as a service (SaaS). The provider delivers an end-user web application. End-users interact with the application for inference. Customers are usually charged a flat fee per user, as in the case of LLMs such as ChatGPT, Microsoft 365 Copilot, and the AI image generator Midjourney.

Platform as a service (PaaS). The provider delivers access to AI services via application programming interface (API) calls. The customer architects an application that uses these API calls in its underlying code to access AI capabilities. For example, an airline's website includes a chatbot for customer support. The airline manages the application, which sends and receives requests to and from a chatbot service offered by a cloud provider and configured by the customer.

Customers are usually charged according to the number and size of requests and responses. Common PaaS services include language translation (e.g., Google Translation API), speech-to-text transcription (e.g., Azure Text Analytics API), chatbots (e.g., Amazon Lex), image recognition (e.g., Amazon Rekognition) and sentiment analysis (e.g., IBM Natural Language Processing).

A provider may provide both PaaS and SaaS services based on the same model. For example, OpenAI provides inference through its own web frontend ChatGPT via SaaS. OpenAI also provides inference through its OpenAI API via PaaS.

In both SaaS and PaaS, the provider:

  • Trains the model. It provisions the data center space, sets up a hardware cluster, deploys the model to the cluster, and then executes middleware to train the model using training data. It retrains the model periodically.
  • Manages inference on the model. The provider provisions the data center space, sets up appropriate hardware, deploys the model to the hardware, and then runs inference queries on this model.

In both cases, the cloud provider owns and updates the model. The customer can configure the PaaS and SaaS to a limited degree, but this customization does not impact the underlying model. There is a single model accessed by many customers.

PaaS and SaaS are the easiest methods for accessing AI-based functionality, but offer the least customization.

Self-trained

In a self-trained generative AI deployment, the customer can mix and match different methods for training and inference.

Customers may choose to develop a model from scratch or use a foundation model. A foundation model is an AI model, which has been pre-trained by a software vendor. Examples include Google’s Gemini, Anthropic’s Claude, IBM’s Granite, and OpenAI’s GPT-4.

The value of the foundation model is that it provides a starting point for further development, avoiding the complexity and costs of training a model from scratch. The foundation model has been pre-trained for general-purpose use; it is the fine-tuning of the foundation model by the customer that makes the tuned model suitable for a particular use-case. The customer is responsible for procuring the model, deploying it to infrastructure, and fine-tuning it into a tuned model.

Usually, the intellectual property of a model fine-tuned by a customer for a specific use case (from a foundation model) belongs to the customer who trained it.

Customers have several options for training:

Total control. The customer uses an on-premises or colocation data center to host a dedicated server cluster. The customer trains a model from scratch, or the foundation model is retrained using the customer’s own choice of middleware and training data.

There are costs associated with hardware, power, data center space, other infrastructure, and specialist skills. Generative AI training generally requires GPU clusters that can cost millions of dollars to purchase, with the high power consumption of clusters also driving operating costs.

This method is the most customizable and controllable training method but is often costly and complex. A foundation model is usually created by a hyperscaler or large software vendor using this method, which is then provided to customers for fine-tuning.

Total control, apart from the infrastructure. The customer uses cloud infrastructure as a service (IaaS), including CPUs (e.g., AWS EC2), GPUs (e.g., Alibaba Elastic GPUs), and storage (e.g., Oracle Cloud Object Storage), to train their own model using their own choice of middleware and training data. Costs depend on the capacity used for the period it was used. Customers may also need to pay expenses related to storage, data transfer, and transactions.

Use training as a service. The customer uses a development platform that automates and simplifies some aspects of training. For example, the platform may manage the distribution of the model to IaaS resources, scale these resources to handle requirements, and provide access to foundation models from a library. There are costs associated with the consumption of on-demand cloud services for both model training and the development platform itself. Foundation models may be priced by the number of tokens (components of sentences) put into the model and extracted out. Examples of platforms include Amazon SageMaker, Google Vertex AI, and Microsoft Azure Machine Learning.

Inference. Depending on the training method and software used, the trained model can be deployed to various locations for inference. A model trained using one method can often be deployed for inference using a different method. As inference is a component of a larger application, it is usually deployed close to the rest of the application’s resources to reduce latency. The trained model is a package of code and weights that calculates the output (a classification or prediction) from any input, such as a user query.

Deployment to dedicated equipment

Training often requires powerful servers running GPUs. However, inference can be optimized and simplified to run on less powerful devices, including standard servers, end-user devices, custom application-specific integrated circuits (ASICs), and Edge devices. In this method, the customer manages the inference hardware, middleware, and model.

Deployment to IaaS

For inference, machine learning models can be installed on IaaS. A virtual machine typically hosts the code that executes the inference, with access to GPU or ASIC resources if required. The customer is responsible for scaling the infrastructure with changing demand. In addition to virtual machine costs, there may also be costs related to storage, data transfer, and transactions.

Deployment to serverless, inference as a service

A serverless platform allows customers to execute code without control of (or visibility) the underlying infrastructure. The customer provides the cloud operator with the code; the cloud operator takes responsibility for executing that code in line with rules and triggers set up by the customer.

In a serverless inference service, the cloud provider will run the model, usually responding to API queries from the customer’s application to the serverless platform. Customers are charged for these queries and the model's storage. Examples include Amazon SageMaker Serverless Inference and Azure AI Model Inference.