Amazon SageMaker: What Tutorials Don’t Teach

At Fetch we reward you for taking pictures of store and restaurant receipts. Our app needs to read and understand crumpled, dark, smudged, warped, skewed, creased, you get the “picture” images, taken in cars, in your lap, on the way out, while walking the dog, taking out the trash, doing your nails, etc.., etc. Not surprisingly we do a lot of machine leaning trying to understand those images and we do it with SageMaker Training Toolkit.

I’m not yet tapping into a lot of the capabilities of this toolkit, so we will be learning together in the series of posts that follow.

The goal is not to regurgitate many available tutorials that get you started (I’m grateful to all of them), but to share the little “gotchas” they never talk about. It is kind of like learning a language from the textbook vs hearing what the native speakers are saying. And so…

Trust but Verify

Never has this advice been more salient. Available documentation can be outdated, incomplete, misleading, too chunky or too thin. This includes this post as well. We just keep trying.

Training Flow

There are 2 stages to this. In the first we develop and debug our data processing, training, and evaluation “locally”, i.e. in an environment where we can step through our code and debug it, in the second we move all we can to the SageMaker cloud.

Local

  1. Write & debug your training script locally (or on a VM instance in the cloud)
  2. Use frameworks with sklearn type estimators or “Trainers” wherever possible, like PyTorch Lightning (you can use GridAI cloud to quickly debug training scripts in VSCode) or HuggingFace. There are lots of goodies packed in those, so you can:
    • Eliminate repetitive ceremonious code (looping through epochs, maintaining gradient accumulation, loss & optimizer, running validation, etc.)
    • Refactor data batching and separate it from the model code cleanly
    • Support distributed training easily
    • Aid with optimization, such as learning rate scheduling
    • Hook into popular tracking engines like MLFlow and Tensorboard seamlessly
    • Oftentimes – they’ll wrap your actual framework of choice
    • Things I forgot or am not aware of yet

Cloud

1. Use SageMaker script mode to send your code on its way with a pre-built container

The great thing about it is that once the code has been debugged, the only thing needed to execute it in the cloud is a command like:

pytorch_estimator = PyTorch(
  entry_point='dist_train.sh',
  source_dir='./docker/scripts',
  hyperparameters=hyperparameters,
  image_uri=img_uri,
  role=role,
  instance_type='ml.p3.16xlarge',
  instance_count=1,
  volume_size=200,
  use_spot_instances=False,
  max_run=48*60*60,
)

The cool parts about this are:

  • entry_point – it does not have to be a '.py' file, shell scripts are also supported, but not bash. The entry_point is a script (python or shell, or python module) in the source_dir that SageMaker will run to train your model.
  • source_dir – anything that is relevant to your training goes here, starting with the entry_point script and will be copied to /opt/ml/code on the SageMaker instance preserving your directory structure. If your entry_point script is a Python file, SageMaker will also install dependencies in requirements.txt if it finds it in this directory. Pretty neat!
  • image_uri – I prefer this to specifying such parameters as “version of Python”, “version of the framework”, etc. These other parameters are simply hints that help SageMaker find the right image_uri, which is a Docker image of your framework of choice. Some combinations of these parameters may not find anything, so I prefer to find the image beforehand (more about it later) and be certain that my call to estimator instantiation succeeds.
  • volume_size – some value that makes sense in Gb. The default is 30 and it is possible to run out of space.
  • max_run – maximum time to run in seconds. Default to 24hrs.
  • hyperparameters – this is how you pass parameters to your training script.
hyperparameters = {"mlflow_tracking_uri": "URI", 
   "mlflow_experiment": "MyExperiment"}

SageMaker does this:

./dist_train.sh --mlflow_tracking_uri "URI" --mlflow_experiment MyExperiment

You handle them from here in 2 ways:

  • parse them in your shell script, and pass them down to the .py script

— OR —

  • my preferred way, since I cannot really write shell code (especially since bash does not appear to work), use environment variables to pass the information to your Python script. For each hyperparameter SageMaker creates a variable named SM_HP_<uppercase(parameter_name)>, and so in the Python code:
parser = ArgumentParser()
parser.add_argument('--mlflow_experiment', 
    help="MLFlow experiment name", 
    default=os.environ["SM_HP_MLFLOW_EXPERIMENT"])

Important: The names of these parameters contain underscores rather than more eye-pleasing dashes (mlflow_experiment, NOT mlflow-experiment). This is because SageMaker does not normalize parameter names when it creates environment variables, and so mlflow-experiment would be named SM_HP_MLFLOW-EXPERIMENT (with a dash!) and environment variables cannot easily have dashes in their names.

There is a 3rd way to do this since SageMaker also stores hyperparameters in a JSON config file, but that would mean writing special code to accommodate SageMaker, and I’m not a fan of special cases when they can be avoided.

2. Launch by passing data to the estimator:

pytorch_estimator.fit({'images': images_s3, 
    'labels': labels_s3})

images_s3 and labels_s3 are S3 buckets with data deployed.

Data will be copied by SageMaker and placed into /opt/ml/input/data/<bucket_dictionary_key>, so labels in the above example will go under /opt/ml/input/data/labels.

Of course there is an environment variable for this! SM_INPUT_DIR contains the name of the input root directory (/opt/ml/input/data), and SM_CHANNEL_<uppercase(channel_name)> is where the data will end up: SM_CHANNEL_LABELS = /opt/ml/input/data/labels in this example.

3. Examine / download the trained model locally.

Once the above call is issued, training starts in the cloud and the local machine has done its part. Trained model is uploaded to pytorch_estimator.model_data after training is done. If you turned off your machine or went to do other things while the model is training for the next 24 hours, the URL can be located in the SageMaker Web console entry for the training job.

Stuff You Find Out the Hard Way

Getting image_uris

As mentioned above image_uri is a sure way to get the estimator instantiation to succeed. It is possible to give it clues as to framework, CUDA, and Python versions, but success is not guaranteed. A good algorithm for finding the image_uri is:

  1. Go to Available Deep Learning Container Images and either find what you are looking for directly, copy/paste to your code or, better:
  2. Use image_uris.retrieve() to find what you are looking for based roughly on what is available:
img_uri = sagemaker.image_uris.retrieve(framework='pytorch', 
            region=aws_region, 
            image_scope='training', 
            version="1.9", 
            instance_type='ml.p3.16xlarge', 
            py_version='py38')

This call may require different sets of parameters for different frameworks. A bit of a headache but you only need to do it once!

Environment Variables

They are your friends. Get to know them.

sagemaker Python Package

sagemaker.s3 should not be ignored. Makes it super-easy to manipulate your S3 data:

from sagemaker.s3 import S3Uploader
S3Uploader.upload(local_folder_name, s3_bucket_uri)

Packaging Data

If you are downloading all of the data to your training instance(s), make sure to zip it up if there are a lot of files. Size does not matter as much as quantity. I was running training on ~600,000 JPEG files, each ~2-3K in size. Downloading all this to the training instance was taking forever so I had to interrupt it, and it took almost no time to download them all GZipped into a couple of archives.

Something like this will help create an archive with flat structure (no subfolders in the .gz file!)

 with io.BytesIO(encoded_bytes) as f:
     info = tarfile.TarInfo("image_name.jpg")
     info.size = len(encoded_bytes)
     f.seek(0, io.SEEK_SET)
     tar_file.addfile(info, f)

Then in your entry_point script:

# save current directory before un-tarring
CWD=$(pwd)

cd $SM_CHANNEL_IMAGES
ls *.gz | xargs -n1 tar -xvf
cd $CWD

Script Mode or Your Own Docker Container?

I vote strongly in favor of the former (this is when you use a pre-built instance like PyTorch, Tensorflow, or HuggingFace, topping it with your own files). Here is why:

  • You get to modify things easily, skipping a step where you need to build & push your own docker image. Just change things in your source_dir and send everything over to train
  • You don’t need to worry about building a GPU-based container that won’t work.

GPU-based images are hard to get right: too many versions of too many components need to match, to say nothing of CUDA drivers, architectures, and other hardware related complexities. Existing estimator devs have already solved all of this for SageMaker so you don’t have to.

Here is why you may consider bringing your own image:

  • Dependencies required for your model to train take too long to install

In an extreme case of that, it makes sense to pre-build everything, to avoid waiting for the dependencies to be installed every time, but there are other considerations: how long does installation of dependencies take compared to the entire training process? In my case it takes 1hr to build everything up and then 15 hrs to fine-tune the model on 8 GPUs. I will probably take the extra hour, especially if I have already tried to build a compatible image and failed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.