How to deploy model in Python using TensorFlow Serving?

Deploying machine learning models is crucial for making AI applications functional in production environments. TensorFlow Serving provides a robust, high-performance solution for serving trained models efficiently to handle real-time requests.

In this article, we will explore how to deploy a TensorFlow model using TensorFlow Serving, from installation to testing the deployed model.

What is TensorFlow Serving?

TensorFlow Serving is a flexible, high-performance serving system for machine learning models designed for production environments. It allows you to deploy new algorithms and experiments while keeping the same server architecture and APIs.

Installation and Setup

Installing TensorFlow Serving

Install the TensorFlow Serving API using pip ?

pip install tensorflow-serving-api

Installing TensorFlow Serving via Docker

For a complete setup, use Docker to install TensorFlow Serving ?

docker pull tensorflow/serving

Preparing and Saving Your Model

Before deployment, save your trained model in the SavedModel format that TensorFlow Serving understands ?

import tensorflow as tf
from tensorflow import keras
import numpy as np

# Create a simple model for demonstration
model = keras.Sequential([
    keras.layers.Dense(10, activation='relu', input_shape=(4,)),
    keras.layers.Dense(3, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Generate sample training data
X_train = np.random.random((100, 4))
y_train = np.random.randint(0, 3, (100,))

# Train the model
model.fit(X_train, y_train, epochs=5, verbose=0)

# Save the model in SavedModel format
model_path = "./saved_model/1"
tf.saved_model.save(model, model_path)
print(f"Model saved to {model_path}")
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Model saved to ./saved_model/1

Starting TensorFlow Serving

Start the TensorFlow Serving server using Docker ?

docker run -p 8501:8501 \
  --mount type=bind,source=$(pwd)/saved_model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving

Making Predictions via REST API

Once the server is running, you can make predictions using the REST API ?

import requests
import json
import numpy as np

# Prepare sample input data
input_data = np.random.random((1, 4)).tolist()

# Create the request payload
data = {
    "signature_name": "serving_default",
    "instances": input_data
}

# Send POST request to TensorFlow Serving
url = "http://localhost:8501/v1/models/my_model:predict"
response = requests.post(url, data=json.dumps(data))

if response.status_code == 200:
    predictions = response.json()["predictions"]
    print("Prediction:", predictions[0])
else:
    print("Error:", response.status_code, response.text)

Making Predictions via gRPC

For better performance, use gRPC protocol ?

import grpc
import numpy as np
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf

# Create gRPC channel
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Prepare request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'

# Add input data
input_data = np.random.random((1, 4)).astype(np.float32)
request.inputs['dense_input'].CopyFrom(
    tf.make_tensor_proto(input_data, shape=input_data.shape))

# Get prediction
response = stub.Predict(request, 10.0)  # 10 seconds timeout
output = tf.make_ndarray(response.outputs['dense_1'])
print("Prediction:", output[0])

Model Versioning

TensorFlow Serving supports model versioning by organizing models in numbered directories ?

saved_model/
??? 1/          # Version 1
?   ??? saved_model.pb
?   ??? variables/
??? 2/          # Version 2
?   ??? saved_model.pb
?   ??? variables/

Comparison of Serving Methods

Method Protocol Performance Best For
REST API HTTP Good Web applications, debugging
gRPC gRPC Excellent High-throughput applications
TensorFlow Lite N/A Mobile-optimized Mobile and edge devices

Monitoring and Scaling

For production deployments, consider these strategies ?

  • Load Balancing: Use multiple TensorFlow Serving instances behind a load balancer
  • Containerization: Deploy using Docker and Kubernetes for scalability
  • Monitoring: Track metrics like latency, throughput, and error rates
  • Health Checks: Implement health check endpoints for monitoring

Conclusion

TensorFlow Serving provides a robust solution for deploying machine learning models in production. Use REST API for web applications and gRPC for high-performance scenarios. Proper model versioning and monitoring ensure reliable production deployments.

Updated on: 2026-03-27T07:30:08+05:30

332 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements