HuggingFace. Models. Spaces. Part 2

HuggingFace Spaces by Midjourney

This concludes the two-part blog entry on turning HuggingFace into a deep learning playground (and we have not even talked about all the LMs they host!). Mostly it will be about Spaces, but we are starting with models.

We train tons of models as we experiment with hyperparameters or redesign model structures and do all sorts of studies. It’s easy to lose yourself completely in all the artifacts we accumulate. HuggingFace Models repos alleviate some of this burden.

I store different model artifacts for any and all models on HuggingFace. Often, if the model is very light weight I don’t even bother to stage it in the S3 bucket for SageMaker training. One headache less. Besides the ubiquitous .from_pretained it is easy to download the model artifact in the same way we download datasets (see previous post).

cached_model_file_path = \
    hf_hub_download(model_repo, file_path_in_repo, token=auth_token)

The model will be downloaded into the cache directory locally only once (unless you change it), so next call will resolve almost instantly.

To upload model artifacts:

api = HfApi()
api.create_repo(model_repo, token=auth_token, private=True, exist_ok=True)
api.upload_file(path_or_fileobj=model_file, 
                    path_in_repo=model_file_name, 
                    repo_id=model_repo, token=auth_token)

HuggingFace APIs cache datasets and models in a local cache directory.

Sometimes, when you deal with a large dataset, and things take a while, you may want to take your data operations to the cloud, e.g., an AWS SageMaker Notebook instance. So, you grab a notebook instance, and with foresight, allocate an extra 300Gb of storage to it. Then you start loading the data with load_dataset and quickly find out that you have run out of storage.

This is because you have added storage as a separate volume, not the one where the dataset is being downloaded and cached!

Fortunately load_dataset as well as hf_hub_download have a cache_dir parameter, so you can redirect your cache to the volume that has enough allocated space. In the SageMaker notebook case something like /home/ec2-user/SageMaker/hf_caches could be the choice.

Streamlit (and Gradio) were invented to let data scientists become app developers. It is essential for us to write apps if we want a way to play with our models: change parameters and see the effects, and share our results with co-workers and teams.

Streamlit (Gradio) does all this, but you still need a platform to share the app. This is the missing piece HuggingFace Spaces provide.

I have been using Streamlit for almost two years now and it is spreading more and more at Fetch, so this entry is going to be about Streamlit, but it applies to Gradio as well.

From the user perspective, a Space is a Git repo from which your application just runs. That’s all.

There are several flavors of spaces: Gradio, Streamlit, Docker, Static. Docker has a collection of templates available. In reality, since there is no magic, whatever template you pick (Gradio or Streamlit), it will still be wrapped into a Docker container which will then be launched. So, the Docker option just offers more flexibility, assuming “you know what you are doing”.

I found that while using the Streamlit option directly pretty much gives you what you want almost all the time, still flexibility may be required. In my case, I really wanted Streamlit new nested columns feature which had been released but not yet available for Streamlit Spaces, as it was only supporting v0.17 of Streamlit SDK at the time I was writing my app.

Since then, I started using Docker spaces for my Streamlit apps, because it is a rare case when you get all the flexibility with practically no cost.

Here is an example of a Streamlit Space:

And here is the same Space re-created using the Blank Docker template:

You can download these from the Files section of the Space.

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
WORKDIR /app

COPY ./requirements.txt /app/requirements.txt
COPY ./packages.txt /app/packages.txt

RUN apt-get update && xargs -r -a /app/packages.txt apt-get install -y && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir -r /app/requirements.txt
RUN pip3 install --no-cache-dir jinja2==3.0.1
# User
RUN useradd -m -u 1000 user
USER user
ENV HOME /home/user
ENV PATH $HOME/.local/bin:$PATH

WORKDIR $HOME
RUN mkdir app
WORKDIR $HOME/app
COPY . $HOME/app

EXPOSE 8501
CMD streamlit run app.py

This is pretty much boilerplate. The only thing I had to add is on line 11. For some reason, my app was refusing to run on a lesser version of jinja2. I have not seen it in other examples, so perhaps it is no longer necessary.

On line 23 we expose the port on which Streamlit apps run by default and on line 24 we launch it, assuming our code is in app.py.

Also, I am using PyTorch with CUDA as a base container, but really anything that makes sense could be used.

In the world of Spaces, README describes some key parameters that determine how the Space is run. In particular:

sdk: docker
app_port: 8501

Other settings are related to appearance, but the 2 above are the most important.

Under the .streamlit folder, this file describes appearances but also some key application properties:

[theme]
primaryColor="#e5ab00"
font="sans serif"

[server]
maxUploadSize = 200
enableXsrfProtection = false

[theme] section gives your app the HuggingFace themed look and feel, while the [server] section ensures the Streamlit file upload feature works.

Specify everything you need there. E.g., a version of streamlit newer than the one supported by streamlit spaces.

I have also found out recently, that the latest version of HuggingFace datasets API (2.14.2) does not work in Streamlit Docker containers. Specify 2.13.1 if using datasets:

streamlit==1.24.1
datasets==2.13.1

You will definitely need your secret token in order to pull data and models in the HuggingFace Spaces.

For this, you can set private environment variables in your Space on its Settings page

You can first debug your app locally as you would a regular Streamlit app (e.g., using VSCode facilities to attach a debugger).

I usually put these lines somewhere at the top of my app.py:

try:
    import ptvsd
    ptvsd.enable_attach()
    ptvsd.wait_for_attach()
except:
    pass

I don’t distribute ptvsd to the Space environment, so the exception will always be caught and supressed, while in my local environment the app will pause and wait for debugger to be attached.

VSCode launch.json entry for debugging looks like this:

{
    "name": "Streamlit",
    "type": "python",
    "request": "attach",
    "connect": {
        "host": "localhost",
        "port": 5678
    },
    "pathMappings": [
        {
            "localRoot": "${workspaceFolder}",
            "remoteRoot": "."
        }
    ],
    "justMyCode": false
}

For docker – I simply build:

docker build -t my_space .

To make sure everything runs:

docker run --rm -p 8501:8501 my_space

Or, if I want to examine the insides of the container:

docker run -it --rm my_space /bin/bash

While my HuggingSpace lab doesn’t solve all the problems, it takes a ton of headaches away. Now, every time I start a new project, it is as if I am 30% done, because I know where my data is going to be stored and what API I will use to access it, how I will deal with artifacts and how I will observe and share my experiments. I know I am not tied to any particular environment, but wherever I go, my data and my models will follow.

Perhaps this will make your life a bit easier as well.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.