Viral F#

HuggingFace. Models. Spaces. Part 2

fierval — Sun, 06 Aug 2023 23:26:41 +0000

HuggingFace Spaces by Midjourney

This concludes the two-part blog entry on turning HuggingFace into a deep learning playground (and we have not even talked about all the LMs they host!). Mostly it will be about Spaces, but we are starting with models.

Models

We train tons of models as we experiment with hyperparameters or redesign model structures and do all sorts of studies. It’s easy to lose yourself completely in all the artifacts we accumulate. HuggingFace Models repos alleviate some of this burden.

I store different model artifacts for any and all models on HuggingFace. Often, if the model is very light weight I don’t even bother to stage it in the S3 bucket for SageMaker training. One headache less. Besides the ubiquitous .from_pretained it is easy to download the model artifact in the same way we download datasets (see previous post).

cached_model_file_path = \
    hf_hub_download(model_repo, file_path_in_repo, token=auth_token)

The model will be downloaded into the cache directory locally only once (unless you change it), so next call will resolve almost instantly.

To upload model artifacts:

api = HfApi()
api.create_repo(model_repo, token=auth_token, private=True, exist_ok=True)
api.upload_file(path_or_fileobj=model_file, 
                    path_in_repo=model_file_name, 
                    repo_id=model_repo, token=auth_token)

A Word about Cache

HuggingFace APIs cache datasets and models in a local cache directory.

Sometimes, when you deal with a large dataset, and things take a while, you may want to take your data operations to the cloud, e.g., an AWS SageMaker Notebook instance. So, you grab a notebook instance, and with foresight, allocate an extra 300Gb of storage to it. Then you start loading the data with load_dataset and quickly find out that you have run out of storage.

This is because you have added storage as a separate volume, not the one where the dataset is being downloaded and cached!

Fortunately load_dataset as well as hf_hub_download have a cache_dir parameter, so you can redirect your cache to the volume that has enough allocated space. In the SageMaker notebook case something like /home/ec2-user/SageMaker/hf_caches could be the choice.

Spaces

Streamlit (and Gradio) were invented to let data scientists become app developers. It is essential for us to write apps if we want a way to play with our models: change parameters and see the effects, and share our results with co-workers and teams.

Streamlit (Gradio) does all this, but you still need a platform to share the app. This is the missing piece HuggingFace Spaces provide.

I have been using Streamlit for almost two years now and it is spreading more and more at Fetch, so this entry is going to be about Streamlit, but it applies to Gradio as well.

Spaces and Their Flavors

From the user perspective, a Space is a Git repo from which your application just runs. That’s all.

There are several flavors of spaces: Gradio, Streamlit, Docker, Static. Docker has a collection of templates available. In reality, since there is no magic, whatever template you pick (Gradio or Streamlit), it will still be wrapped into a Docker container which will then be launched. So, the Docker option just offers more flexibility, assuming “you know what you are doing”.

I found that while using the Streamlit option directly pretty much gives you what you want almost all the time, still flexibility may be required. In my case, I really wanted Streamlit new nested columns feature which had been released but not yet available for Streamlit Spaces, as it was only supporting v0.17 of Streamlit SDK at the time I was writing my app.

Since then, I started using Docker spaces for my Streamlit apps, because it is a rare case when you get all the flexibility with practically no cost.

Creating A Docker Space running Streamlit

Here is an example of a Streamlit Space:

And here is the same Space re-created using the Blank Docker template:

You can download these from the Files section of the Space.

Dockerfile

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
WORKDIR /app

COPY ./requirements.txt /app/requirements.txt
COPY ./packages.txt /app/packages.txt

RUN apt-get update && xargs -r -a /app/packages.txt apt-get install -y && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir -r /app/requirements.txt
RUN pip3 install --no-cache-dir jinja2==3.0.1
# User
RUN useradd -m -u 1000 user
USER user
ENV HOME /home/user
ENV PATH $HOME/.local/bin:$PATH

WORKDIR $HOME
RUN mkdir app
WORKDIR $HOME/app
COPY . $HOME/app

EXPOSE 8501
CMD streamlit run app.py

This is pretty much boilerplate. The only thing I had to add is on line 11. For some reason, my app was refusing to run on a lesser version of jinja2. I have not seen it in other examples, so perhaps it is no longer necessary.

On line 23 we expose the port on which Streamlit apps run by default and on line 24 we launch it, assuming our code is in app.py.

Also, I am using PyTorch with CUDA as a base container, but really anything that makes sense could be used.

README.md

In the world of Spaces, README describes some key parameters that determine how the Space is run. In particular:

sdk: docker
app_port: 8501

Other settings are related to appearance, but the 2 above are the most important.

config.toml

Under the .streamlit folder, this file describes appearances but also some key application properties:

[theme]
primaryColor="#e5ab00"
font="sans serif"

[server]
maxUploadSize = 200
enableXsrfProtection = false

[theme] section gives your app the HuggingFace themed look and feel, while the [server] section ensures the Streamlit file upload feature works.

requirements.txt

Specify everything you need there. E.g., a version of streamlit newer than the one supported by streamlit spaces.

I have also found out recently, that the latest version of HuggingFace datasets API (2.14.2) does not work in Streamlit Docker containers. Specify 2.13.1 if using datasets:

streamlit==1.24.1
datasets==2.13.1

HuggingFace Token and Other Secrets

You will definitely need your secret token in order to pull data and models in the HuggingFace Spaces.

For this, you can set private environment variables in your Space on its Settings page

Local Debugging and Testing

You can first debug your app locally as you would a regular Streamlit app (e.g., using VSCode facilities to attach a debugger).

I usually put these lines somewhere at the top of my app.py:

try:
    import ptvsd
    ptvsd.enable_attach()
    ptvsd.wait_for_attach()
except:
    pass

I don’t distribute ptvsd to the Space environment, so the exception will always be caught and supressed, while in my local environment the app will pause and wait for debugger to be attached.

VSCode launch.json entry for debugging looks like this:

{
    "name": "Streamlit",
    "type": "python",
    "request": "attach",
    "connect": {
        "host": "localhost",
        "port": 5678
    },
    "pathMappings": [
        {
            "localRoot": "${workspaceFolder}",
            "remoteRoot": "."
        }
    ],
    "justMyCode": false
}

For docker – I simply build:

docker build -t my_space .

To make sure everything runs:

docker run --rm -p 8501:8501 my_space

Or, if I want to examine the insides of the container:

docker run -it --rm my_space /bin/bash

Conclusion

While my HuggingSpace lab doesn’t solve all the problems, it takes a ton of headaches away. Now, every time I start a new project, it is as if I am 30% done, because I know where my data is going to be stored and what API I will use to access it, how I will deal with artifacts and how I will observe and share my experiments. I know I am not tied to any particular environment, but wherever I go, my data and my models will follow.

Perhaps this will make your life a bit easier as well.

HuggingFace. The Perfect Lab. Part 1

fierval — Mon, 31 Jul 2023 18:26:29 +0000

I have recently posted a video briefly explaining how we use GNNs (Graph Neural Nets) at Fetch to identify duplicate receipts. I have mentioned HuggingFace in passing there, but in fact I use it quite heavily to make my machine learning life simpler.

The Problem: Suitcases Without Handles

AI workloads are data heavy, involving large amounts of images, annotations, and interconnected data, stored in disparate files and databases across local machines and cloud services. The data requires cleaning, division into training, validation, and testing sets, and multiplies with each training cycle.

Additionally, training numerous models with varying hyperparameters generates further artifacts to track.

As if that were not enough, we create applications to monitor and compare our models’ performance, visualize and analyze data, adding to the pile of accumulated artifacts.

This leads to two primary challenges:

tracking all artifacts and
maintaining the developed solutions.

These issues become more pronounced when transitioning the solution to different people or machines, even with cloud sharing services like AWS S3. And this is how managing AI workloads resembles handling heavy luggage without handles: infuriating to shlepp around but impossible to discard.

Courtesy of Midjourney

The HuggingFace Solution

For the last year, I have been using HuggingFace to bring some order into this chaos. HuggingFace has practically become my lab for all things deep learning. The most attractive features are a comprehensive playground as well as the ease of use through their UX and API. HuggingFace provides facilities for:

Datasets – for storing & sharing data
Models – for keeping track of models
Spaces – for developing apps for visualization and showcasing the models

Below is the overview of some basic tech I use.

Datasets

Creating

Datasets API works great for making your data follow you wherever you go. In my work at Fetch I mostly deal with images and text. Regardless of what I end up using for training (images, text, graphs, etc.) once data is extracted, I create and upload a HuggingFace dataset.

HuggingFace supports a few data formats out of the box. It also allows for creating custom ones. I like to reuse the former rather than create my own. Imagefolder serves my purposes perfectly.

To store images of documents (pictures of receipts), I collect all the images in a single folder and create a metadata.jsonl file in the same folder. This file connects all the non-image data I need with each image. The only stricture here is that there should be a 1-1 correspondence between image files in the directory and lines in the metadata file.

Each line in the file is a JSON object. I create each object with 2 fields: file_name – image file name (required by Imagefolder format), and ground_truth where I stuff all of the non-image features. I believe it is possible to have more than 2 fields (never checked that), as long as the number of fields remains constant throughout the file, since the data is going to be packaged into a columnar ApacheArrow format and it implies a constant number of columns.

{"file_name": "1.jpg", "ground_truth": "\"{some JSON object serialized to string}\""}
{"file_name": "2.jpg", "ground_truth": "\"{some other JSON object serialized to string}\""}

So, to store additional data, I package the entire structure into a Python object and then flatten it into a string:

metaddatas = []
for fn in images:
    metadata = {}
    metadata["user_id"] = user_id
    metadata["some_data"] = some_data
    metadata["and_more_data"] = more_data
    gt_str = json.dumps(metadata).encode("utf-8", "ignore").decode("utf-8")
        
    metadata_str = json.dumps({"file_name": fn, "ground_truth": gt_str}) + "\n"
    metaddatas += [metadata_str]
    
with open(metadata_fn, "w") as f:
    f.writelines(metaddatas)

Uploading

To upload the dataset to your organization account (or your own), you need to be either logged in with huggingface-cli, or, my preferred way, use a token. The token is easlity obtained by going to Settings -> Access Tokens in the HuggingFace profile:

Manage this token through your environment variables, .env files and python-dotenv package. If, for example, your .env file contains something like:

HF_AUTH_TOKEN="some_alphanumeric_sequence"

do something similar in Python:

from dotenv import load_dotenv
load_dotenv()
hf_token = os.environ.get("HF_AUTH_TOKEN", None)

After that, you can create your dataset storage (which is a Git repo) on HuggingFace and upload the data:

from datasets import load_dataset
from huggingface_hub import HfApi

my_repo = "my_account/my_dataset"
api = HfApi()
api.create_repo(repo_id=my_repo, token=hf_token, repo_type="dataset", private=True, exist_ok=True)

dataset = load_dataset("imagefolder", data_dir=images_and_metadata_root)
dataset.push_to_hub(my_repo, token=hf_token)

Here datasets and huggingface_hub are modules contained in the respective Python packages. I believe you only need to install datasets, huggingface_hub will be installed as part of its dependencies.

Here lines 5-6 will create a private dataset repo if it doesn’t already exist, while the final 2 lines will package the data in ApacheArrow format, shard it if necessary, and upload.

Train/Validation/Test Splits

There are facilities to create splits before or after the dataset has been uploaded (i.e., after downloading it locally).

I found it practical to pre-split the data locally and then upload everything in one shot. The only thing that changes is data organization. The root directory (images_and_metadata_root) needs to have train, validation, and test subdirectories. Each containing their own images and metadta.jsonl files – one per subdirectory. Then the code above can be used to upload everything in one shot, maintaining the splits.

Single File Storage

Sometimes it may be convenient to package a dataset as a single file, e.g. with PyTorch Geometric, the Graph Neural Network library for graph learning, uses PyTorch to store data as binary files. Since HuggingFace repositories are just Git repositories, it is easy to adopt them for storing these files as well. The code is very similar to the one above:

api = HfApi()

# local .pt file containing the data
local_fn = my_graph_dataset_file

# name without any path
out_fn = os.path.basename(local_fn)

# where to store it in 'my_repo'
path_in_repo = f"{subfolder}/graphs/{out_fn}"

api.upload_file(path_or_fileobj=local_fn, repo_id=my_repo, repo_type="dataset", path_in_repo=path_in_repo, token=hf_token)

Download & Transform

Now that the dataset is in the cloud and can follow you anywhere you go, it is easy to download and transform it into anything you want.

HuggingFace uses local cache to store the data, so you only take a perf hit the first time you download it.

from datasets import load_dataset

hf_dataset = load_dataset("my_account/my_dataset", use_auth_token=hf_token)

We can easily recover all the fields we packed into a string and convert the result to whatever we like.

features = hf_dataset.map(lambda x: json.loads(x["ground_truth"]))
feats_dict = [{"user_id": x["user_id"], 
                          "some_data": x["some_data"],
                          ""and_more_data": x[""and_more_data""]} 
             for x in hf_dataset["train"]]

#Convert to pandas dataframe
df = pd.DataFrame(featds_dict)

To be continued

I’ll cover Models and Spaces in the second part of this post.

Amazon SageMaker: Distributed Training

fierval — Sat, 28 Jan 2023 02:45:33 +0000

No training implementation is complete until it allows training on a cluster where each machine has multiple GPUs.

Multi-node/Multi-GPU Training with PyTorch Lightning

SageMaker does a great job enabling this in Script Mode, and all we have to do is write code that supports SageMaker SMDDP implementation of the distributed training DDP protocol.

PyTorch Lighting is also an obvious choice to abstract our training loop, since it supports everything we need, wraps everything up nicely, so you don’t need to gather validation results or make sure logging is activated correctly on the appropriate process.

This PyTorch Lightning Introduction into their distributed API is a great starting point.

Training Script Modifications

Just follow instructions from AWS to enable your training scripts.

I like to wrap it all into a set of funnctions:

is_win = sys.platform.startswith("win")

def get_trainer_env():
  env = LightningEnvironment()
  env.world_size = lambda: int(os.environ.get("WORLD_SIZE", 1))
  env.global_rank = lambda: int(os.environ.get("RANK", 0))
  return env

def get_initialization_info():
  '''
  Initialize the distributed training environment and return the data relevant to
  Lighning Trainer initialization.
  '''
  world_size = num_nodes = 1
  ddp = None
  num_gpus = int(os.environ.get("SM_NUM_GPUS", 1))

  if not is_win and num_gpus > 1:
    import smdistributed.dataparallel.torch.torch_smddp
    
    # For DDP with sagemaker
    env = get_trainer_env()
    
    ddp = DDPStrategy(
      cluster_environment=env,
      process_group_backend="smddp",
    )

    world_size = int(os.environ["WORLD_SIZE"])
    num_nodes = int(world_size/num_gpus)
    
    logging.info(f"Training with {num_gpus} GPUs/node on {num_nodes} nodes")
    
  return ddp, num_gpus, num_nodes

def get_global_rank():
  return int(os.environ.get("RANK", 0))

def get_local_rank():
  return int(os.environ.get("LOCAL_RANK", 0))

The get_initialization_info function can be called from the training script to return all the data needed for distributed or non-distributed training initialization. So, this is either an SMDDP run or a non-distributed training run.

Since DDP is not supported on Windows, we are making doubly sure to not enable it if we are running in that environment.

The import statement on line 19 will only work inside the SageMaker script mode container, so we tuck it safely under the if statement to prevent it from executing in a non-SageMaker environment.

The purpose of the DDPStrategy instance defined on line 24 is to hook up PyTorch Lightning with the protocol SageMaker uses to pass necessary data about the world size and rank designation to the participating processes. Rank can be local or global. Global rank is an integer in the [0, WORLD_SIZE] (not including the upper bound) range, uniquely designating each process, while local rank is [0, NUM_LOCAL_GPUS] and is assigned to a process within its node.

Initializing the PyTorch Lightning trainer is then straightforward:

ddp, num_gpus, num_nodes = dist.get_initialization_info()

trainer = pl.Trainer(
        accelerator="cuda",
        strategy=ddp,
        devices=num_gpus,
        num_nodes=num_nodes,
        max_epochs=args.epochs,
        val_check_interval=args.val_check_interval,
        check_val_every_n_epoch=args.check_val_every_n_epoch,
        gradient_clip_val=1.0,
        precision=16, # we'll use mixed precision
        num_sanity_val_steps=0,
        logger=logger,
        callbacks=[checkpoint],
  )

Handling I/O Conflicts

With multiple processes per node (each assigned to its own GPU), we may have a situation where the same data is being written/downloaded multiple times on the same node, which may cause conflicts and possibly crashes.

To avoid that, we can use PyTorch Lightning DataModule facility with its prepare_dataset function which takes care of the initial download safely by running it on a single process per node. Set self.prepare_data_per_node=True during module initialization and execute downloading code in a prepare_data override. See the example of this in the prepare_data documentation for PyTorch Lightning.

For instance, if downloading or creating a HuggingFace dataset:

def prepare_data(self, stage=None):
    load_dataset(self.dataset_name_or_path)

We are just creating the dataset locally, actual Dataset creation will happen in the setup override of the Lightning DataModule

SageMaker Notebook Modifications

from sagemaker.pytorch import PyTorch

# if running on a single instance with a single GPU
instance_type = 'ml.p3.2xlarge'

# Recommend one of these instances for multi-GPU cluster training
#instance_type = 'ml.p4d.24xlarge' 
instance_type = 'ml.p3dn.24xlarge'

# base job for easy identification in SageMaker console
base_job_name = 'donut-ddp-mult-instance-smddp'

distribution = None
distribution = {"pytorchddp": {"enabled": "true"}}
distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}

pytorch_estimator = PyTorch(
    source_dir='finetuning',
    entry_point='train.py',
    base_job_name=base_job_name,
    hyperparameters=hyperparameters,
    framework_version="1.12.1",
    py_version='py38',
    role=role,
    instance_type=instance_type,
    instance_count=4,
    volume_size=200,
    use_spot_instances=False,
    max_run=48 * 60 * 60,
    security_group_ids=["My-SecurityGroup"],
    distribution=distribution,
)

distribution is set to the SMDDP backend, AWS docs indicate that PyTorch native DDP backend is also fully supported, but I haven’t tried it.

It’s a good idea to let the estimator figure out the appropriate version of the PyTorch image, since a lot is riding on different frameworks being compatible, so not specifying image_uri, but requesting versions of PyTorch and Python instead.

Don’t forget the requirements.txt file in the source_dir, which should at the minimum contain the line:

pytorch-lightning==1.7.7

I found this version of PyTorch Lightning to play well with SageMaker.

One more thing to not forget: the effective learning rate will be the chosen lr * world_size, so set it accordingly in the hyperparameters:

hyperparameters = {
    "epochs": 30,
    "batch": 4,
    "lr": 1e-7 * 8 * 4,
}

A Gotcha

For desert we have this doozy of a gotcha. If you have specified a security group to the estimator through security_group_ids make sure the group has appropriate permissions for inbound and outbound communications as described in this article. This one got me stomped until an AWS Support specialist pointed out the solution.

Happy (distributed) training!

Amazon SageMaker: What Tutorials Don’t Teach

fierval — Sun, 30 Jan 2022 01:37:52 +0000

At Fetch we reward you for taking pictures of store and restaurant receipts. Our app needs to read and understand crumpled, dark, smudged, warped, skewed, creased, you get the “picture” images, taken in cars, in your lap, on the way out, while walking the dog, taking out the trash, doing your nails, etc.., etc. Not surprisingly we do a lot of machine leaning trying to understand those images and we do it with SageMaker Training Toolkit.

I’m not yet tapping into a lot of the capabilities of this toolkit, so we will be learning together in the series of posts that follow.

The goal is not to regurgitate many available tutorials that get you started (I’m grateful to all of them), but to share the little “gotchas” they never talk about. It is kind of like learning a language from the textbook vs hearing what the native speakers are saying. And so…

Trust but Verify

Never has this advice been more salient. Available documentation can be outdated, incomplete, misleading, too chunky or too thin. This includes this post as well. We just keep trying.

Training Flow

There are 2 stages to this. In the first we develop and debug our data processing, training, and evaluation “locally”, i.e. in an environment where we can step through our code and debug it, in the second we move all we can to the SageMaker cloud.

Local

Write & debug your training script locally (or on a VM instance in the cloud)
Use frameworks with sklearn type estimators or “Trainers” wherever possible, like PyTorch Lightning (you can use GridAI cloud to quickly debug training scripts in VSCode) or HuggingFace. There are lots of goodies packed in those, so you can:
- Eliminate repetitive ceremonious code (looping through epochs, maintaining gradient accumulation, loss & optimizer, running validation, etc.)
- Refactor data batching and separate it from the model code cleanly
- Support distributed training easily
- Aid with optimization, such as learning rate scheduling
- Hook into popular tracking engines like MLFlow and Tensorboard seamlessly
- Oftentimes – they’ll wrap your actual framework of choice
- Things I forgot or am not aware of yet

Cloud

1. Use SageMaker script mode to send your code on its way with a pre-built container

The great thing about it is that once the code has been debugged, the only thing needed to execute it in the cloud is a command like:

pytorch_estimator = PyTorch(
  entry_point='dist_train.sh',
  source_dir='./docker/scripts',
  hyperparameters=hyperparameters,
  image_uri=img_uri,
  role=role,
  instance_type='ml.p3.16xlarge',
  instance_count=1,
  volume_size=200,
  use_spot_instances=False,
  max_run=48*60*60,
)

The cool parts about this are:

entry_point – it does not have to be a '.py' file, shell scripts are also supported, but not bash. The entry_point is a script (python or shell, or python module) in the source_dir that SageMaker will run to train your model.
source_dir – anything that is relevant to your training goes here, starting with the entry_point script and will be copied to /opt/ml/code on the SageMaker instance preserving your directory structure. If your entry_point script is a Python file, SageMaker will also install dependencies in requirements.txt if it finds it in this directory. Pretty neat!
image_uri – I prefer this to specifying such parameters as “version of Python”, “version of the framework”, etc. These other parameters are simply hints that help SageMaker find the right image_uri, which is a Docker image of your framework of choice. Some combinations of these parameters may not find anything, so I prefer to find the image beforehand (more about it later) and be certain that my call to estimator instantiation succeeds.
volume_size – some value that makes sense in Gb. The default is 30 and it is possible to run out of space.
max_run – maximum time to run in seconds. Default to 24hrs.
hyperparameters – this is how you pass parameters to your training script.

hyperparameters = {"mlflow_tracking_uri": "URI", 
   "mlflow_experiment": "MyExperiment"}

SageMaker does this:

./dist_train.sh --mlflow_tracking_uri "URI" --mlflow_experiment MyExperiment

You handle them from here in 2 ways:

parse them in your shell script, and pass them down to the .py script

— OR —

my preferred way, since I cannot really write shell code (especially since bash does not appear to work), use environment variables to pass the information to your Python script. For each hyperparameter SageMaker creates a variable named SM_HP_, and so in the Python code:

parser = ArgumentParser()
parser.add_argument('--mlflow_experiment', 
    help="MLFlow experiment name", 
    default=os.environ["SM_HP_MLFLOW_EXPERIMENT"])

Important: The names of these parameters contain underscores rather than more eye-pleasing dashes (mlflow_experiment, NOT mlflow-experiment). This is because SageMaker does not normalize parameter names when it creates environment variables, and so mlflow-experiment would be named SM_HP_MLFLOW-EXPERIMENT (with a dash!) and environment variables cannot easily have dashes in their names.

There is a 3rd way to do this since SageMaker also stores hyperparameters in a JSON config file, but that would mean writing special code to accommodate SageMaker, and I’m not a fan of special cases when they can be avoided.

2. Launch by passing data to the estimator:

pytorch_estimator.fit({'images': images_s3, 
    'labels': labels_s3})

images_s3 and labels_s3 are S3 buckets with data deployed.

Data will be copied by SageMaker and placed into /opt/ml/input/data/, so labels in the above example will go under /opt/ml/input/data/labels.

Of course there is an environment variable for this! SM_INPUT_DIR contains the name of the input root directory (/opt/ml/input/data), and SM_CHANNEL_ is where the data will end up: SM_CHANNEL_LABELS = /opt/ml/input/data/labels in this example.

3. Examine / download the trained model locally.

Once the above call is issued, training starts in the cloud and the local machine has done its part. Trained model is uploaded to pytorch_estimator.model_data after training is done. If you turned off your machine or went to do other things while the model is training for the next 24 hours, the URL can be located in the SageMaker Web console entry for the training job.

Stuff You Find Out the Hard Way

Getting `image_uri`s

As mentioned above image_uri is a sure way to get the estimator instantiation to succeed. It is possible to give it clues as to framework, CUDA, and Python versions, but success is not guaranteed. A good algorithm for finding the image_uri is:

Go to Available Deep Learning Container Images and either find what you are looking for directly, copy/paste to your code or, better:
Use image_uris.retrieve() to find what you are looking for based roughly on what is available:

img_uri = sagemaker.image_uris.retrieve(framework='pytorch', 
            region=aws_region, 
            image_scope='training', 
            version="1.9", 
            instance_type='ml.p3.16xlarge', 
            py_version='py38')

This call may require different sets of parameters for different frameworks. A bit of a headache but you only need to do it once!

Environment Variables

They are your friends. Get to know them.

`sagemaker` Python Package

sagemaker.s3 should not be ignored. Makes it super-easy to manipulate your S3 data:

from sagemaker.s3 import S3Uploader
S3Uploader.upload(local_folder_name, s3_bucket_uri)

Packaging Data

If you are downloading all of the data to your training instance(s), make sure to zip it up if there are a lot of files. Size does not matter as much as quantity. I was running training on ~600,000 JPEG files, each ~2-3K in size. Downloading all this to the training instance was taking forever so I had to interrupt it, and it took almost no time to download them all GZipped into a couple of archives.

Something like this will help create an archive with flat structure (no subfolders in the .gz file!)

 with io.BytesIO(encoded_bytes) as f:
     info = tarfile.TarInfo("image_name.jpg")
     info.size = len(encoded_bytes)
     f.seek(0, io.SEEK_SET)
     tar_file.addfile(info, f)

Then in your entry_point script:

# save current directory before un-tarring
CWD=$(pwd)

cd $SM_CHANNEL_IMAGES
ls *.gz | xargs -n1 tar -xvf
cd $CWD

Script Mode or Your Own Docker Container?

I vote strongly in favor of the former (this is when you use a pre-built instance like PyTorch, Tensorflow, or HuggingFace, topping it with your own files). Here is why:

You get to modify things easily, skipping a step where you need to build & push your own docker image. Just change things in your source_dir and send everything over to train
You don’t need to worry about building a GPU-based container that won’t work.

GPU-based images are hard to get right: too many versions of too many components need to match, to say nothing of CUDA drivers, architectures, and other hardware related complexities. Existing estimator devs have already solved all of this for SageMaker so you don’t have to.

Here is why you may consider bringing your own image:

Dependencies required for your model to train take too long to install

In an extreme case of that, it makes sense to pre-build everything, to avoid waiting for the dependencies to be installed every time, but there are other considerations: how long does installation of dependencies take compared to the entire training process? In my case it takes 1hr to build everything up and then 15 hrs to fine-tune the model on 8 GPUs. I will probably take the extra hour, especially if I have already tried to build a compatible image and failed.

A Deep Reinforcement Learning Journey Home.

fierval — Tue, 27 Aug 2019 05:30:57 +0000

On a warm afternoon of October 23, 1975, one of the last days of Indian summer that year, Yozhik (aka Ejik, The Little Hedgehog) set out to visit his friend Medvezhonok (The Bear Cub). He was joining him, as he did every evening, for a night of stargazing. Only he never made a return journey home. Until now. But today the nightmares lurking in the mist the other night have become reality. Will Ejik survive?

The inspiration behind this project came from the article Deep Reinforcement Learning Doesn’t Work Yet. While whatever little experience I have with DRL (Deep Reinforcement Learning) supports the article’s conclusions it also charts a clear path for experimenting with DRL algorithms.

One advantage DRL has over vision or NLU is that its algorithms create their own labelled data, which, if you are doing research or playing in the sandbox (like me) usually comes from environments created by organizations and communities like Gym from OpenAI or ML-Agents from Unity.

While these are excellent, they don’t allow for an end-to-end exploration: the task, observations, actions, and the reward function are already given, while according to the author of the above article this is where quite a lot of fun (and frustration) of DRL happens.

And so I set out to create an environment of my own and picked Unity ML-Agents for the task. The setup is such that you create a game in Unity and then you can outfit your game characters with a “brain” (ML-Agents terminology), through which you can have them act autonomously. Unity has created several environments which show how these brains can be trained using DRL to perform various tasks. Also check out this course from Udacity to get started on DRL.

The Game

After acquiring some remedial game development skills by mostly following this course, I have created the simple Ejik Goes Home top-down shooter where the goal is simple: just stay alive for as long as possible. In reality Ejik is not going anywhere, all he can do is defend himself against 3 categories of enemies.

Even though the game can be played by a human, it was developed with Ejik in mind: he would acquire enough intelligence through DRL to play it himself.

The game is available HERE

The source is HERE

Adding Intelligence to the Main Character

Our character initially has no idea what is going on and has no access to the internal state of the game. By taking action in the environment and collecting rewards or incurring costs he is eventually able to achieve the goal the environment has set for him.

Environment and Observations

Unity environment for Ejik is simply the main and only level of the game. It can be taken over by the ML-Agent Python API with the help of which Ejik’s brain is outfitted with a trained deep neural net implemented in PyTorch.

Ejik’s view of the environment is obtained from a camera that is attached to the player (its position is always (0, 0) relative to the player himself) and gives a 300×200 pixels resolution view of the surroundings. We extract the grey scale image to use as an observation. The camera is positioned in such a way so it does not see the background.

3D View of Ejik Camera

EjikBrain properties are also set at design time. It also defines the action space explained below:

EjikBrain Settings: observation and action spaces

It is important not to forget to check the Control box when adding the brain to the Academy (object that oversees autonomous actions in the environment)

This will make sure the brain can be driven by an external process.

This notebook is handy for exploring the environment from Ejik’s point of view.

This is an example of a a single frame that the Ejik-attached camera sees. The camera itself is rendering to RenderTexture (a very convenient feature of ML-Agents). This setup implies that the user is responsible for the actual rendering of the image into its RenderTexture every time a new observation is required.

Observations are sent to the external brain on FixedUpdate, so this is a good place to call renderCamera.Render() where renderCamera is the camera attached.

Details on defining observations are covered in this Unity ML-Agents doc.

The motivation behind using visual rather than vector observations was:

Ease of implementation: just render to a texture and done
Easy to debug by simply visualizing
Felt more interesting: the brain is going to be seeing what the character is seeing (kinda) and so whatever it learns should be all the more impressive.

Actions

We need to enable 3 types of actions:

movement
swinging the weapon
shooting

So, how do we define the action space? Should it be continuous, discrete, or hybrid? While the shooting action is clearly discrete (shoot/don’t shoot), swinging is continuous rather than discrete: the angle can be anything in but we can discretize it. Movement can be either discrete or continuous depending on what we want to do: we can represent it as a direction in degrees in or a vector (x, y) that would yield the same direction in , and then it would be continuous, or we can mimic game controls and represent it as a vector of “Up”, “Down”, “Left”, “Right” discrete directions.

Eventually I decided to represent movement as a continuous action in , mainly because it’s easy to represent swinging the weapon as continuous, now shooting had to be made continuous or a hybrid space could be implemented. This paper looked promising in its treatment of hybrid action spaces, however, for the sake of simplicity in this first experiment, I made the shooting action continuous by simply mapping it over : positive value means “shoot” non-positive – “don’t shoot”.

The Game Arithmancy. Reward Function.

The player starts with 0.5 health points and faces 3 kinds of enemies: Raven, Moth, and Horse. Raven and Moth attack at set intervals and the attack lasts for some time, causing damage of 0.1 (-0.1 reward). These values (interval, duration) in sec are (5, 1) for Raven and (2, 2) for Moth. Horse does not attack, it spawns Moth. Once player health falls below 0, the player dies. The death blow cost is 0.5 instead of 0.1.

Ejik’s dandelion can shoot projectiles (with a 0.1 sec interval) and its blossom is deadly. The weapon inflicts a 0.1 damage on the enemies. Killing an enemy adds to Ejik’s (player’s) health (the player is rewarded the same number of points): 0.02 for Moth and Raven, 0.05 for Horse. Enemies possess health points as well: 0 for Moth and 0.1 for Raven. Death occurs for everyone in the game when health < 0.

Thus we get our reward function.

Ejik moves with the speed of 5, enemies with max speed of 3 for Raven and 2 for Moth. Horse doesn’t move. The A* Pathfinding component is in charge of guiding the enemies towards player.

An episode consists of 3000 steps, for which the current goal is to survive. A more complex goal would be to get some positive reward, but that’s for later.

Rewards correlate directly with health, but while initial health = 0.5, initial reward = 0, so it is possible to survive an episode and get a negative reward. Dying before the episode ends will always cause the reward to be < -0.9 (lose all health and get a -0.5 bonus on top of it). Minimal possible reward is -1.0.

The number of enemies is parameterized, enemies are generated spontaneously once there are too few of them in the game, generation stops once there is enough. This only applies to Raven and Horse, since Moth is spawned by Horse at fixed intervals while Horse exists. Enemies appear at random points in the scene outside a certain radius away from the player. Training was done with 1 <= Raven <=3 and maximum 1 Horse, however, during evaluation number of Ravens was increased to [3, 6].

Model Input

Each of the 3000 steps in one episode occurs on Unity physics update (FixedUpdate), which happens every 0.02 seconds. The input to the model consists of 6 game frames each grabbed at 3 frame intervals. Thus during 3000 steps there are decisions, i.e. action vectors that the model outputs, or fewer if Ejik dies prematurely.

Once the episode ends, the environment resets, i.e. Ejik is returned to the (0,0) coordinate of the level and his health is set to 0.5.

Random Play

Let’s check out what happens in the environment if we are making random decisions.

Random Game

If we collect game stats over 1000 random episodes we will get:

Mean random reward: -0.6317, std: 0.367, episode length: 131.582

Histogram of survival time (maximum 167) on the left, distribution of rewards – on the right.

Thus played at random we mostly die: the -1.0 “death bar” is prominently featured on the reward distribution and certain survival happens less than half the time (number of steps in an episode =167).

Unfortunately it turns out that beating the random game is easy. While our random play was:

actions = np.random.randn(4)

We need but minimal determinism: shoot for sure at every step:

actions = np.r_[np.random.randn(3), [0.5]]

Mean shooting reward: 0.0592, std: 0.294, episode length: 165.329

Immediately we start surviving rather than dying with virtual certainty.

Still it would be interesting to see if at the very least a DRL algorithm can learn this simple strategy. After all, mapped to the spectrum between Tic-tac-toe and Go, we are closer to the former, so “simple” is not unexpected.

Model and Algorithm

The PPO algorithm seems an obvious choice due to:

ease of implementation
promise of fast convergence

This is a policy-based Actor Critic algorithm where we assess the advantage (critic) of an old policy and guide the evolving policy (actor) towards maximizing this advantage. I also like it because it is on-policy, so we can go through a sequence of steps and discard our data right away, no need for a replay buffer with its memory demands.

The code for this implementation is on my GitHub.

After some experimentation, the model I picked was this (details are in drl\PPO\model.py):

Both Actor and Critic share a CNN feature extractor:

    def hidden_layers(self):
        return [
            nn.Conv2d(self.state_dim[0], 16, 4, stride=4),
            nn.LeakyReLU(),
            nn.Conv2d(16, 32, 3, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(32, 64, 3, stride=2),
        ]

here self.state_dim[0] is the number of stacked frames (6 in our case), the CNN wants its input tensors in the NCHW format (N-batch size, C-number of channels, 6 in our case, H – height, W – width – 200 and 300 respectively)

The rest of the layers for Critic and Actor are:

        self.fc_hidden = self.hidden_layers()
        fc_critic = self.fc_hidden \
            + [Flatten(), 
                nn.Linear(conv_size, conv_size // 2),
                nn.LeakyReLU(),
                nn.Linear(conv_size // 2, 1)]

        fc_actor = self.fc_hidden \
            + [Flatten(), 
                nn.Linear(conv_size, conv_size // 2),
                nn.Tanh(),
                nn.Linear(conv_size // 2, self.action_dim), 
                nn.Tanh()]

        self.actor = nn.Sequential(*fc_actor)
        self.critic = nn.Sequential(*fc_critic)
        self.log_std = nn.Parameter(torch.zeros(1, act_size))

Since the action output is a vector of means of a multivariate Gaussian, we also need a vector to represent the variances, so we can fully represent a distribution from which final actions will be computed.

I used Adam optimizer with a slowly decaying learning rate which started at . The details are in drl\PPO\driver.py

Training and Evaluation

The model as described trains for ~10-12 hours while the game runs at 10 times the regular speed. So 600 games/hr are played and the brain catches on after about 6000 games and doesn’t improve after that. The entire process can be observed in Tensorboard. Tensorboard consumable events are output to the runs subdirectory.

We evaluate our model, as before, by playing 1000 games. Here is the comparison of the resulting brain activity with random and minimally-deterministic strategies described above (all charts are generated by the Ejik Analysis notebook):

Mean trained reward: -0.0002, std: 0.316, episode length: 163.675

The resulting weights can be downloaded here. drl\PPO\eval.py can be used to run the environment after building the MainScene in Unity.

In the Academy object, EjikAcademy component, Inference Configuration / Time Scale needs to be changed to 1 before building for real-time replay.

Conclusions

Clearly, we have achieved our survival goals. The brain powered by our AI survives almost as well as the minimally-deterministic strategy. It is disappointing, though, that we didn’t do better, considering the simplicity of our non-AI strategy.

For the future, it would be interesting to investigate a more “strict” reward function: assign cost to shooting, disable the weapon ability to cause touch damage, make enemies more threatening by decreasing attack intervals, etc.

It would also be interesting to investigate the hybrid action model, where shooting decision remains discrete.

Finally, making the game purely DRL based where all participants truly “know not what they do” initially but as training progresses gradually fall into their roles.

Or perhaps raising the stakes all around is the way to go. Perhaps a brand-new more complex game is in order!

Supercharging Object Detection in Video: from Glacial to Lightning Speed

fierval — Mon, 25 Mar 2019 15:35:19 +0000

In the following series I will explore different tools and techniques for doing object detection in streaming video in real time or faster. Starting with the baseline Python detector running slowly and gradually picking up speed.

In these series

In the course of these posts we will explore optimizing object detection in videos. We will use an SSD detector and trace its evolution from the baseline performance of ~19 fps to the vertiginous 200 fps. We will need a powerful GPU for this: the one I benchmarked was Titan V, but any other powerful CUDA-enabled device will do. Here is the sequence of the posts:

Introduction: Baseline
Setup – how to build everything we need for these experiments
First App – first optimization results
Optimizing Decoding and Graph Feeding – address video decoding and network setup issues to extract performance gains
TensorRT 5 Achieving our turbo goals with NVIDIA TensorRT framework.

Baseline

Suppose you are streaming a video to your local machine with a Volta GPU (mine is Titan V if I’m doing it at home or a V-100 if I’m grabbing an Azure VM). You picked Tensorflow as your framework and you have trained an object detector, let’s say an Inception V2 SSD with transfer learning, or even forget training. You are using a stock COCO dataset trained Inception V2 SSD detector.

The result you want is something like the video above, where the objects are detected and marked in real time. If you google Python code for it, you will end up with something like this:

class TfOD:
    '''
    Detectors
    '''
    pred_keys = ['num_detections', 'detection_boxes', 'detection_scores', 'detection_classes', 'detection_masks']
    
    def __init__(self, detpath=None, labelpath=None):
        self.detpath = detpath
        self.labelpath = labelpath

        if self.detpath is not None:
            self.load_model()
        if self.labelpath is not None:
            self.get_categories()

        self.sess = tf.InteractiveSession(graph=self.G)

    def __enter__(self):
        return self

    def __exit__(self,  exc_type, exc_val, exc_tb):
        if self.sess is not None:
            self.sess.close()

    def load_model(self):
        self.G = tf.Graph()
        with self.G.as_default():
            od_graph_def = tf.GraphDef()
            with tf.gfile.GFile(self.detpath, 'rb') as fid:
                serialized_graph = fid.read()
                od_graph_def.ParseFromString(serialized_graph)
                tf.import_graph_def(od_graph_def, name='')

    def get_categories(self, NUM_CLASSES=90):
        label_map = label_map_util.load_labelmap(self.labelpath)
        categories = label_map_util.convert_label_map_to_categories(
            label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
        self.categories = categories

    def detect(self, images, threshold=0.8, denormalize_coordinates=True):
        '''
        Detection generator
        
        Parameters:
            images: list of numpy arrays representing (height, width, 3) images
            threshold: detection threshold
            denormalize_coordinages: should coordinates be converted from absolute to relative to image    dimensions
        Returns:
            iterable of dictionaries of pred_keys detections (per image)
        '''

        # Get handles to input and output tensors
        ops = tf.get_default_graph().get_operations()
        all_tensor_names = {
            output.name for op in ops for output in op.outputs}
        tensor_dict = {}

        for key in self.pred_keys:
            tensor_name = key + ':0'
            if tensor_name in all_tensor_names:
                tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(tensor_name)

        image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')

        for i, image in enumerate(images):
            ts = perf_counter()
            tfpred = self.sess.run(tensor_dict, feed_dict={image_tensor: np.expand_dims(image, 0)})
            te = perf_counter()
            
            # extract actual predictions
            tfpred = {k : v[0] for k, v in six.iteritems(tfpred)}
            
            tfpred['detection_classes'] = tfpred['detection_classes'].astype(np.int32)
            
            # threshold
            thresh = tfpred['detection_scores'] >= threshold
            tfpred = {k: v[thresh] for k, v in six.iteritems(tfpred) if k != 'num_detections' and k != 'time'}
            tfpred['num_detections'] = len(np.nonzero(thresh)[0])
            
            if denormalize_coordinates:
                width = image.shape[1]
                height = image.shape[0]
                
                detections_boxes = []
                for r in tfpred['detection_boxes']:
                    y1, x1, y2, x2 = r
                    detections_boxes.append([int(y1 * height), int(x1 * width), int(y2 * height), int(x2 * width)])
                tfpred['detection_boxes'] = np.array(detections_boxes)
                
            tfpred['time'] = te - ts

            yield tfpred

You can instantiate it and run detect() on every frame of your video. The frames will be extracted using OpenCV



cap = cv2.VideoCapture(video_path)

nFrame = 30

iFrame = 0

start = time.time()
pred_ims = []
with TfOD(frozen_graph_path, label_file_path) as detector:

    while iFrame < 500:

        ret, frame = cap.read()

        if not ret:

            break
        iFrame += 1
        if iFrame % nFrame == 0:

            end = time.time()

            total = float(end - start)

            fps = float(nFrame) / total

            print("fps: {:.2f}".format(fps))

            start = time.time()
        threshold = 0.1

        # we really suck with the current model...

        for i, (pred, im) in enumerate(zip(detector.detect([frame], threshold=threshold), [frame])):

            rects = pred['detection_boxes']

            scores = pred['detection_scores']

            classes = pred['detection_classes']
            if len(rects) > 0:

                rects, scores = non_max_suppression_with_tf(detector.sess, rects, scores, 5, threshold)

                rects = np.array(rects)

                if True:

                    img = vis_utils.visualize_boxes_and_labels_on_image_array(

                        im,

                        rects, classes, scores, category_index,

                        instance_masks=None, use_normalized_coordinates=False, line_thickness=8, min_score_thresh=threshold

                    )

                    pred_ims.append(img)

cap.release()

Then you launch it on a 720 HD video stream running at 30 fps (frames per second) and performance you get is:

fps: 8.24
fps: 22.16
fps: 22.28
fps: 20.58
fps: 20.63
fps: 19.28
fps: 19.74
fps: 18.91
fps: 18.93
fps: 19.08
fps: 19.01
fps: 18.91
fps: 16.68
fps: 17.71
fps: 17.55
fps: 17.51

(From my Jupyter Notebook, collecting data every 30 frames). Notice that I’m clocking:

Decoding and reading a video frame
Running the object detector on it.

All the auxiliary stuff (in this case – collecting individual frames with detection results drawn on them) is not included in the timing.

Ok, not quite glacial, more than half way there, we have 18.75 fps on average performance.

Let’s look at one of the frame grabs:

So we are doing the right thing.

Gearing up

In the course of these posts we will achieve an order of a magnitude performance increase on a single stream coming into a single GPU without sacrificing accuracy (too much).

We can then take this project to the next level with NVIDIA DeepStream and edge devices where up to 30 streams can be piped into a single GPU and still keep the real time detection performance on each stream.

The purpose of this is learning, observing, touching things with our own hands, so let’s get technical!

Supercharging Object Detection in Videos: Setup

fierval — Mon, 25 Mar 2019 15:29:39 +0000

We started from the Python object detector performance as baseline (~ 19 fps).

Next we ditch Python and all our pre-installed libraries and custom build everything. C++ will become the development environment not just because it’s more “bare bones” than Python and thus more performant but also to access functionality not available in Python.

Environment

NVIDIA CUDA supporting GPU
Ubuntu 16.04 LTS
NVIDIA Driver v410
CUDA 10 (or 9 for those feeling less adventurous)
TensorRT 5.0.2
Anaconda Python latest release
CMake 3.8+ (for CUDA kernel compilation)
Tensorflow r1.12+ (with Bazel 0.19.2 to build it)
OpenCV 3.3+
Inception SSD V2 Object Detector frozen Tensorflow graph

I assume that all libraries and build tools (gcc v5, etc) necessary are already installed or will be while installing the above toolkits.

Drivers

Install v396 for CUDA 9 or v410 for CUDA 10.

Install the required NVIDIA driver

IMPORTANT: After installation create symlinks to codec libraries:

sudo ln -s /usr/lib/nvidia-396/libnvcuvid.so /usr/lib/libnvcuvid.so
sudo ln -s /usr/lib/nvidia-396/libnvcuvid.so.1 /usr/lib/libnvcuvid.so.1

NVIDIA Toolkits

Install CUDA 9 (CUDA 10 for those who like only the newest and shiniest toys) and TensorRT.

NOTE: If installing CUDA 10, you need to copy all the files called dynlink_*.hpp from /usr/local/cuda/include in CUDA 9 toolkit into the same directory of CUDA 10. NVIDIA has removed these from the toolkit as the codec is being separated into its own SDK.

Anaconda Python

Install the latest Python 3 Anaconda distribution

Once installed, create a new Python 3.6 environment.

conda create -n py36 python=3.6 anaconda
source activate py36

This will activate the newly installed Python 3.6

Tensorflow

Follow instructions to build tensorflow from source. Skip to “Install Bazel” section. Install Bazel 0.19.2. Do it all from the Anaconda prompt above with the py36 environment active. Checkout r1.12 (or anything later than r1.10).

You may need to download dependencies (Eigen, Protobuf) into the build tree by running tensorflow/tensorflow/tree/master/tensorflow/contrib/makefile/download_dependencies.sh. This is not always a good idea: I had the latest (3.3.7) version of Eigen downloaded this way break during compilation. No big deal, the components download into the tensorflow/tensorflow/tree/master/tensorflow/contrib/makefile/downloads directory and can be deleted from there if you already have a suitable version of Eigen (3.3.6) or protobuf installed. (Eigen has an “xcopy” installation, you will just need to copy Eigen and unsupported directories from the distribution to /usr/bin/include).

When running ./configure

Point at directories of your py36 environment created above.

Make sure you answer “yes” to CUDA support, answer “Yes” or “No” to TensorRT support, it’s not going to matter for this excercise. Select the appropriate architecture for your GPU (7.0 for Volta). Makes things run much faster when executing inference code.

After running ./configure fixing your bazel configuration may be required if the build does not start. Locate .bazelrc in your /tensorflowAdd the following line at the top of it:

import /tensorflow/tools/bazel.rc

When building pip package or any tensorflow related target you do not need to specify --config=cuda. So skip to “Build pip package” and follow instructions under CPU-only. It will do the right thing and build with CUDA support.

Validate that everything works. In a new terminal, from a location other than where tensorflow code resides:

source activate py36
python
import tensorflow as tf
a = tf.constant([1, 2, 3])
b = tf.constant([4, 5, 6])
c = a + b
sess = tf.Session()
sess.run(c)
sess.close()
exit()

If this works, build the C++ library

bazel build //tensorflow:libtensorflow_cc.so

CMake

CMake package installed on Ubuntu 16.04 by default is v3.5. sudo apt remove it if installed. We are going to need v3.8+, so download the latest from CMake site and install it. After installation the easiest thing is to create symlinks to all the new CMake executables in /usr/bin

OpenCV

Follow instructions to install. You may probably skip the “[compiler]” sub-section of the Required Packages section. If you want to install OpenGL GTK hooks that may be useful later, install OpenGL libraries first and then:

sudo apt-get install libgtkglext1 libgtkglext1-dev

Checkout 3.3.1 (or later) in both opencv and opencv_contrib repos.

Build OpenCV using cmake-gui

Set OPENCV_EXTRA_MODULES_PATH to /modules
BUILD_PERF_TESTS and BUILD_TESTS are unchecked
WITH_CUDA is checked
BUILD_opencv_cudacodec are and WITH_NVCUVID checked
CMAKE_BUILD_TYPE set to Debug if you want to step into OpenCV code later. Or leave blank.
Hit Configure
Hit Generate

make -j7 #runs 7 jobs in parallel
sudo make install
sudo ldconfig

Stage for Build

Finally, copy tensorflow include files and libraries we built above to the location where our future builds will pick them up.

cd 
sudo mkdir /usr/local/tensorflow
sudo mkdir /usr/local/tensorflow/include

sudo cp -r tensorflow/contrib/makefile/downloads/eigen/Eigen /usr/local/tensorflow/include/
sudo cp -r tensorflow/contrib/makefile/downloads/eigen/unsupported /usr/local/tensorflow/include/

sudo cp tensorflow/contrib/makefile/downloads/nsync/public/* /usr/local/tensorflow/include/
sudo cp -r bazel-genfiles/tensorflow /usr/local/tensorflow/include/
sudo cp -r tensorflow/cc /usr/local/tensorflow/include/tensorflow
sudo cp -r tensorflow/core /usr/local/tensorflow/include/tensorflow

sudo mkdir /usr/local/tensorflow/include/third_party
sudo cp -r third_party/eigen3 /usr/local/tensorflow/include/third_party/

sudo mkdir /usr/local/tensorflow/lib
sudo cp bazel-bin/tensorflow/libtensorflow_*.so /usr/local/tensorflow/lib

All done! We will validate the installation in the next post

Supercharging Object Detection in Video: First App

fierval — Mon, 25 Mar 2019 15:28:57 +0000

Tensorflow C++ Video Detector

It is time to validate all this arduous setup work, run our first C++ detector and reap the first benefits. You may clone this repository, which is a fork of this repository, modified and adapted to the modern times.

Ensuring the Right Build Paths

Note the following excerpt from CMakeLists.txt:

set(MYHOME $ENV{HOME})

# IMPORTANT: Protobuf includes. Depends on the anaconda path
# This is Azure DLVM (not sure if DSVM is the same)
include_directories("/data/anaconda/envs/py36/lib/python3.6/site-packages/tensorflow/include/")
# This is a standard install of Anaconda with p36 environment
include_directories("${MYHOME}/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/include/")

The include path we set here is used for imported tensorflow includes, it solves protobuf compilation problems. In theory is the right protobuf is installed, this is not necessary. But I can’t quite get it right, so use this as a band-aid. Definitely something to get fixed right.

OpenCV Mat -> Tensorflow Tensor

In Python we used numpy arrays for storing decoded frames and passing this data to Tensorflow. Here we need to give it a second thought.

Check out this function that bridges between video frames decoded by OpenCV and Tensorflow tensors:

Status readTensorFromMat(const Mat &mat, Tensor &outTensor) {

    // Trick from https://github.com/tensorflow/tensorflow/issues/8033
    uint8_t *p = outTensor.flat().data();
    Mat fakeMat(mat.rows, mat.cols, CV_8UC3, p);
    mat.copyTo(fakeMat);

    return Status::OK();
}

Apparently tensors and Mats have a compatible structure, so we can just fill the tensor with the right data.

Build and Run

We can now build the app:

cd 
mkdir build
cd build
cmake .. # cmake -DCMAKE_BUILD_TYPE=Debug ..
make

Cloning the repository downloads frozen_inference_graph.pb as well as classes.pbtxt for Inception V2 SSD detector into the demo/ssd_inception_v2 subfolder. A sample video is downloaded into the same folder as well. You can change these values at the top of the main function, or better still, set up command line parameter parsing.

string ROOTDIR = "../";
string LABELS = "demo/ssd_inception_v2/classes.pbtxt";
string GRAPH = "demo/ssd_inception_v2/frozen_inference_graph.pb";
string VIDEO_FILE = "demo/ssd_inception_v2/ride_2.mp4";

We are now better than real time @ ~34 fps on Titan V. To make this official:

Next Steps

Time to get serious.
While our video decoding is efficient we are doing a few completely unnecessary memory copies: GPU -> System upon decoding and then System -> GPU once we start running it through the Tensorflow graph. Incidentally, in all regular Tensorflow pipelines it is assumed that Tensorflow graph is fed from the system memory.

The next step is to eliminate these moves and learn to feed Tensorflow graph from the GPU.

Supercharging Object Detection in Video: Optimizing Decoding and Graph Feeding

fierval — Mon, 25 Mar 2019 15:28:28 +0000

In the previous post we validated our install and ran a simple detector in C++. It is now time to start optimizing it. Source code for the finished project is here.

Optimizing Video Decoding

If we build and run the video_reader.cpp OpenCV sample, we will observe a staggering performance improvement available in OpenCV for decoding and reading video.

It is somewhat tricky to make the actual sample work, so I summarized the necessary steps gleaned from some wise folks on GitHub Issues in this repo.

As the screenshot above shows, we have an order of magnitude performance improvement by decoding the video and leaving frames on the GPU. At this point this is our performance increase potential: not only will it allow us to skip unnecessary and expensive memory copies, but also will set the stage for TensorRT which consumes data already on the GPU.

The first step towards this goal is to optimize feeding the Tensorflow graph.

Feeding Tensorflow Graph from the GPU

We are now working with the final version of this application from this repo. The first thing to do is to allocate a GPU tensor and fill it with decoded data, which, at this point, is also residing on the GPU in a GpuMat structure. Let’s deal with this copy first. Here we are just as lucky as we were with bridging Mat with Tensorflow tensors.

Status readTensorFromGpuMat(const cv::cuda::GpuMat& g_mat, Tensor& outTensor) {
    tensorflow::uint8 *p = outTensor.flat().data();
    cv::cuda::GpuMat fakeMat(g_mat.rows, g_mat.cols, CV_8UC3, p);

    // comes in with 4 channels -> 3 channels
    cv::cuda::cvtColor(g_mat, fakeMat, COLOR_BGRA2RGB);

    return Status::OK();
}

A noteworthy bit here is on line 6: the decoded frame has 4 channels, we use cvtColor to drop the transparency channel our network does not use.

Allocating CUDA Tensor

By carefully studying Tensorflow code:

// GPU allocator
#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
#include "tensorflow/core/common_runtime/gpu/gpu_id_utils.h"
#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
#include "tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h"

const string gpu_device_name = GPUDeviceName(session.get());

// allocate tensor on the GPU
tensorflow::TensorShape shape = tensorflow::TensorShape({1, height, width, 3});

tensorflow::PlatformGpuId platform_gpu_id(0);

tensorflow::GPUMemAllocator *sub_allocator =
new tensorflow::GPUMemAllocator(
tensorflow::GpuIdUtil::ExecutorForPlatformGpuId(platform_gpu_id).ValueOrDie(),
platform_gpu_id, false /*use_unified_memory*/, {}, {});

tensorflow::GPUBFCAllocator *allocator =
new tensorflow::GPUBFCAllocator(sub_allocator, shape.num_elements() * sizeof(tensorflow::uint8), "GPU_0_bfc");

inputTensor = Tensor(allocator, tensorflow::DT_UINT8, shape);

To confirm the tensor is indeed residing on the GPU:

bool IsCUDATensor(const Tensor &t)
{
    cudaPointerAttributes attributes;
    cudaError_t err =
        cudaPointerGetAttributes(&attributes, t.tensor_data().data());
    if (err == cudaErrorInvalidValue)
       return false;
   CHECK_EQ(cudaSuccess, err) << cudaGetErrorString(err);
#if CUDART_VERSION >= 10000
    return (attributes.type == cudaMemoryTypeDevice);
#else
    return (attributes.memoryType == cudaMemoryTypeDevice);
#endif
}

CUDA 10 is depricating memoryType attribute, so the conditional compilation avoids compiler warnings.

Feeding Tensorflow Graph from the GPU

Doing this is not standard. See a long discussion on GitHub.
There exists an experimental technology, so things will probably change, but as of release 1.12 it still works. This is a sample from Google.

In our case this works:

CallableOptions opts;
std::unique_ptr session;
Session::CallableHandle feed_gpu_fetch_cpu;

const string inputLayer = "image_tensor:0";
const vector outputLayer = {"detection_boxes:0", "detection_scores:0", "detection_classes:0", "num_detections:0"};

opts.add_feed(inputLayer);
for (auto const &value : outputLayer)
{
    opts.add_fetch(value);
}

const string gpu_device_name = GPUDeviceName(session.get());
opts.clear_fetch_devices();
opts.mutable_feed_devices()->insert({inputLayer, gpu_device_name});

auto runStatus = session->MakeCallable(opts, &feed_gpu_fetch_cpu);
if (!runStatus.ok())
{
    LOG(ERROR) << "Failed to make callable";
}
runStatus = session->RunCallable(feed_gpu_fetch_cpu, {inputTensor}, &outputs, nullptr);
....

We can compare the results by looking at NVIDIA Profiler results for our previous app and the current one:

Feeding from the CPU

Feeding from the GPU

(See the regions framed in deep pink on images above, marking large chunks of memory moved from host to device in the top snapshot)

Profile was taken over 20 seconds and we can see the difference in bytes moved back and forth. Also we can see individual bursts of 2.76 Mb moved from host to device on the “CPU” profile that do not appear on the “GPU”. It is easy enough to calculate that 2.76 Mb is the size of a decoded frame.

Performance Gain

So, how much did we gain through all this? A whopping 10%. We did expect more for all this work, however we will use what we learned to enable bigger gains down the line. It is now time to move to TensorRT.

Supercharging Object Detection in Video: TensorRT 5

fierval — Mon, 25 Mar 2019 15:28:00 +0000

Source code for the finished project is here.

NVIDIA TensorRT is a framework used to optimize deep networks for inference by performing surgery on graphs trained with popular deep learning frameworks: Tensorflow, Caffe, etc.

Preparing the Tensorflow Graph

Our code is based on the Uff SSD sample installed with TensorRT 5.0. The guide together with the README in the sample directory describe steps to take to convert the frozen Tensorflow graph to the UFF format used by TensorRT. Follow these steps to create a UFF file from the Tensorflow frozen graph.

OpenCV GPU to TensorRT Input

We already have sample code dealing with Inception V2 SSD, we have created a TensorRT parsable graph from our Tensorflow graph, we can decode video on the GPU using OpenCV. The only major challenge left is to format decoded frames stored in GpuMat structure to TensorRT input requirements.

This is a good opportunity for a segue on how images are stored in memory. Abstracting the low-level memory, OpenCV stores them in the channel last formatted structure: HWC (Height, Width, Channel). cuDNN used by TensorRT is optimized for the channel first format (CHW). Regardless, our consciousness stores them as 3 parallel planes in a 3d space joining together into one image, so we don’t really care about actual formats as long as the tools let us operate within this abstraction.

Eventually, however, we need to flatten all of this data into a 1d sequence which is our real-world address space. At this moment we need to agree on the order in which elements are marshaled. In numpy as well as C/C++, it is the “last-dimension-first” order. So, if we are storing things the TensorRT way, we wrap columns, then rows, then channels. The first picture below is roughly visualizing this process. In this case, images look like 3 different RGB (HSV, etc) planes flattened into one dimension fairly intuitively.

Channel last Storage (cuDNN)

In the OpenCV world (channel-last, so channel will be the fastest changing dimension to get wrapped) the image below roughly shows the order of this 3d -> 1d mapping. You can think of it as each plane of the bottom image representing one image matrix row, of which each column consists of a three member tuple of channel values:

Channel first storage (OpenCV)

This is kinda unnatural however we don’t care ever due to the 3d abstraction in Python (or Mat or GpuMat abstractions in C++). We only start caring when we need to step outside of the boundaries of our framework of choice (OpenCV). This is one of such cases.

CUDA kernels make quick work of these types of conversions, so we barely give back any of our performance gains (the decoded frame is already on the GPU). We are normalizing the frame for our network making sure each value is in [-1, 1], while we are rearranging the bytes.

We get an impressive performance boost to 143 fps. Almost reached our original goal of an order of magnitude speedup over basic Python.

INT8 Precision Mode

In order to finally nail the goal of a ten-fold performance increase we need to run our TensorRT graph with INT8 precision. You can do a trial run with ride_2.mp4.

INT8 mode requires calibration before running, which will be attempted automatically when running the app if calibration table file is not found in the application current directory.
There is a calibration table already present in the source directory so running on the provided sample video does not requite calibration.

If you want to run on your set videos, you may need to run your own calibration first.

Performing Your Own Calibration

NOTE I recommend changing the ownership of the /usr/src/tensorrt directory to you for convenience: chown -R /usr/src/tensorrt

To calibrate on a subset of videos:

Extract a few frames from each of the chosen videos in PPM format into /usr/src/tensorrt/data/ssd/.
Resize them to 300×300 (size of our input tensor)
Pick a subset of frames at random and merge their file names without extension into /usr/src/tensorrt/data/ssd/list.txt, one name per line.
Run the application as usual but do it from /usr/src/tensorrt directory, so it is set as the application current directory. Sample code, which I have not changed, relies on hard coded locations – definitely needs fixing.

The following script performs required transformations for a single video. You can run it over a set of videos. NVIDIA recommends at least 500 frames for calibration.

ffmpeg -i ride_1.mp4 -vf fps=3,scale=300:300 /usr/src/tensorrt/data/ssd/ride_1_0%4d.ppm
ls -1 /usr/src/tensorrt/data/ssd/ \
| grep ride_1_ \
| sort -R \
| tail --line=200 \
| sed -e s'/\..*$//' >>/usr/src/tensorrt/data/ssd/list.txt

Size of calibration dataset as well as how it is split into batches is controlled by the constants in infer_with_trt.cpp (lines 25 and 26). the current settings indicate calibration will be performed on 2 batches of 50.

static constexpr int CAL_BATCH_SIZE = 50;
static constexpr int FIRST_CAL_BATCH = 0, NB_CAL_BATCHES = 2;

Calibration takes a fairy long time so, no need to panic if the app appears unresponsive. Once calibration succeeds it outputs CalibrationTableSSD in the application directory which is then read on every consecuitive execution.

Final Result

At 200 fps on the calibrated dataset, we have achieved our goal of an order of a magnitude performance improvement over the basic Python detector.

Viral F#

HuggingFace. Models. Spaces. Part 2

Models

A Word about Cache

Spaces

Spaces and Their Flavors

Creating A Docker Space running Streamlit

Dockerfile

README.md

config.toml

requirements.txt

HuggingFace Token and Other Secrets

Local Debugging and Testing

Conclusion

HuggingFace. The Perfect Lab. Part 1

The Problem: Suitcases Without Handles

The HuggingFace Solution

Datasets

Creating

Uploading

Train/Validation/Test Splits

Single File Storage

Download & Transform

To be continued

Amazon SageMaker: Distributed Training

Multi-node/Multi-GPU Training with PyTorch Lightning

Training Script Modifications

Handling I/O Conflicts

SageMaker Notebook Modifications

A Gotcha

Amazon SageMaker: What Tutorials Don’t Teach

Trust but Verify

Training Flow

Local

Cloud

Stuff You Find Out the Hard Way

Getting image_uris

Environment Variables

sagemaker Python Package

Packaging Data

Script Mode or Your Own Docker Container?

A Deep Reinforcement Learning Journey Home.

The Game

Adding Intelligence to the Main Character

Environment and Observations

Actions

The Game Arithmancy. Reward Function.

Model Input

Random Play

Model and Algorithm

Training and Evaluation

Conclusions

Supercharging Object Detection in Video: from Glacial to Lightning Speed

In the following series I will explore different tools and techniques for doing object detection in streaming video in real time or faster. Starting with the baseline Python detector running slowly and gradually picking up speed.

In these series

Baseline

Gearing up

Supercharging Object Detection in Videos: Setup

Environment

Drivers

NVIDIA Toolkits

Anaconda Python

Tensorflow

CMake

OpenCV

Stage for Build

Supercharging Object Detection in Video: First App

Tensorflow C++ Video Detector

Ensuring the Right Build Paths

OpenCV Mat -> Tensorflow Tensor

Build and Run

Next Steps

Supercharging Object Detection in Video: Optimizing Decoding and Graph Feeding

In the previous post we validated our install and ran a simple detector in C++. It is now time to start optimizing it. Source code for the finished project is here.

Optimizing Video Decoding

Feeding Tensorflow Graph from the GPU

Allocating CUDA Tensor

Feeding Tensorflow Graph from the GPU

Performance Gain

Supercharging Object Detection in Video: TensorRT 5

Getting `image_uri`s

`sagemaker` Python Package