ML model optimization using AWS Neuron SDK


I had a chance to work with AWS Neuron SDK to optimize ML models to enable running inference endpoints Inferentia based instances. In this post, I give some insights about Neuron and how to optimize models using Neuron SDK.

A quick intro about Neuron SDK, AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. It supports customers in their end-to-end ML development lifecycle to build new models, train and optimize these models, and then deploy them for production. 1

AWS Neuron is tailored to leverage the full potential of AWS Inferentia chips, which are custom hardware accelerators for machine learning inference. By using Neuron, you can achieve high-throughput and low-latency inference performance, which is crucial for real-time applications.

It support popular machine learning frameworks such as TensorFlow, PyTorch, and MXNet. It’s easy to deploy models developed in these frameworks onto Inferentia chips without significant changes to your existing codebase.

Running inference endpoints on AWS Inferentia chips can be more cost-effective compared to using general-purpose GPUs or CPUs, especially at scale. The Neuron SDK helps in optimizing your models to run efficiently on Inferentia, potentially lowering your cloud infrastructure costs.

To have the performance of Inferentia and Trainium based instances, trained models needs to be optimized with Neuron SDK. For now Neuron SDK is only supported on Linux x86_64 and the best way to use Neuron for optimization is running it on a Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) based image. The linux packages, Neuron SDK and other required packages are already available in this image. Let’s create an EC2 image and optimize a pre-trained Huggingface model.

Here we have a small Python module that uses bert-base-cased-finetuned-mrpc from transformers library:

import os

import torch
import torch.neuron
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Setting up NeuronCore groups for inf1.6xlarge with 16 cores
num_cores = 4  # This value should be 4 on inf1.xlarge and inf1.2xlarge
os.environ["NEURON_RT_NUM_CORES"] = str(num_cores)

# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased-finetuned-mrpc", return_dict=False

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

max_length = 128
paraphrase = tokenizer.encode_plus(
not_paraphrase = tokenizer.encode_plus(

# Run the original PyTorch model on compilation exaple
paraphrase_classification_logits = model(**paraphrase)[0]

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = (
example_inputs_not_paraphrase = (

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)

# Verify the TorchScript works on both example inputs
paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)
not_paraphrase_classification_logits_neuron = model_neuron(

# Save the TorchScript for later use"")

Upload this file to EC2 instance after creating the instance is created and started running. Once we connect to instance via SSH we see a screen like below


We need to activate the Python virtual environment that has the Neuron modules installed

source /opt/aws_neuron_venv_pytorch_inf1/bin/activate

Then we can run our optimization module, it downloads the pre-trained tokenizer and model then starts optimizing the model. The output model artifact should be ready in same directory.


I hope you find this post interesting and useful :)

Enter your instance's address

More posts like this

Machine Learning Model Deployment Techniques

2024-01-01 | #deployment #machine learning #ml model #mlops #techniques

In this first post of 2024, I wanted to give some quick insights about the details of MLOps and ML model deployment, and then different techniques used in deployment. MLOps, short for Machine Learning Operations, is a practice within the field of data science and machine learning that brings together the principles of DevOps and applies them to the unique challenges of machine learning model development and deployment.

Continue reading 