Creative application of computer vision neural networks

In this tutorial, we are going to learn how to set up and run a pre-trained deep learning model and use it creatively. In particular, we will work with a semantic segmentation network capable of recognizing different elements of an image or video and generate a mask from it.

Get the model

The model used in this project performs semantic segmentation, i.e. it splits an image into different regions associated with semantic categories, such as sky, tree, or person. It has been developed by researchers at the MIT Computer Science & Artificial Intelligence Lab using PyTorch and trained on the ADE20K dataset.

First of all, you need to fetch the source code for the model. The version used here has been modified to run on video content instead of individual images. You can download it from GitHub by running:

While it is possible to manually train the model, this requires a large number of computational resources and time. Fortunately, MIT provides the pre-trained weights here. There are multiple variations of the model to chose from, which are listed in the Performance section of the README: the one we are interested in is called UPerNet50. The pre-trained weights, sometimes also called checkpoints, must be placed inside a ckpt directory within the repository:

Note: the tools used above (git and wget ) can be installed on OS X or Windows using the package managers Homebrew or Chocolatey respectively. Alternatively, you can simply download the repository and pre-trained weights manually.

Set up the environment

Once we have all the necessary files, it is time to install the software tools needed for the code to run. Python’s built-in package manager is called pip, and it allows us to easily install any python package. However, to keep these packages isolated from the rest of your system and to properly account for shared dependencies between them, we are going to use conda. Furthermore, it will take care of installing any missing system library and the likes. You can download it from here (make sure to select the python 3.x version) and find installation instructions here.

Once conda is installed, we must create a virtual environment. An environment is like a sandbox where all of the necessary libraries and software can be installed without “polluting” your system. The following two commands will create a virtual environment called ml4mt and enter it:

Finally, we can install the actual software dependencies. In most python projects, they are listed in the requirements.txt file. In our case, these include:

  • Numpy, Scipy, and Pandas: numerical, scientific, and statistical computing libraries
  • PyTorch: a deep learning library
  • Torchvision: a set of utilities, model architectures, and datasets for computer vision
  • OpenCV: a popular library for computer vision tasks
  • YACS and TQDM: utilities for managing configuration files and progress bars

With conda, installing them is as easy as running:

Note: Depending on your operative system and the availability of CUDA-enabled devices (i.e. if you have an Nvidia GPU), you might want to install a different version of PyTorch. Visit this page for more detailed instructions. Furthermore, if you want to try the model in an interactive way using Jupyter notebooks, you must also run:

Run the model

With all the dependencies in place, we are finally ready to run the model. This can be done in the two ways described below.

Interactive (Jupyter notebook)

The Jupyter notebook is based on the one available in the original repository, except it has been adapted to support video input and masking. It allows you to quickly try out different combinations of parameters, view the results, and perform custom data manipulation. However, it is a bit messier. To use it, you must first run Jupyter:

This will open Jupyter into a new browser tab; alternatively, you can copy-paste the link shown in the terminal into your browser. Either way, use the left sidebar to open the DemoSegmenter.ipynb notebook. You will be presented with snippets of code along with its output. In notebooks, the code is arranged in blocks, also known as cells, which can be run sequentially. The output of each block is displayed below it; the data and images currently shown in the notebook are from the last execution and will be updated once you run it yourself.

The cells in this notebook are meant to be run from top to bottom and are organized in sections describing the different operations such as loading the model, loading the data, running the prediction, and so forth. For a more detailed overview of these operations, feel free to jump to the next section of this tutorial. Otherwise, here are some of the variables you might want to edit before executing the cells:

  • file_path: the path of your input video file.
  • step: the number of frames skipped. If 1, the semantic segmentation will run on each frame (might take a long time!).
  • res: if different from None, the input video will be rescaled to this resolution.
  • mask_siz: the resolution of the calculated mask, currently defined as a ratio of the input resolution. Lower resolution means faster prediction
  • dest_path, dest_orig_path: paths where the binary mask and masked video frames will be stored, respectively.
  • classes: list of IDs of semantic labels to isolate in the mask. A full list of available elements is printed at the beginning of the notebook

Programmatic (command-line script)

A script called is provided, offering a command-line interface to the same functionalities of the notebook. This might be useful when processing a large amount of data, or when running the code from a non-graphical environment (e.g. a remote workstation or server) that supports CUDA acceleration. A help text documenting the script and its parameters can be accessed by running:

This will show the following content, along with the list of available semantic labels:

An example execution may look like this:

This will load the video in path/to/video.mp4, process each frame, 8 frames at a time, downscaled 2.5 times, mask out humans (label 12) when the confidence of model exceeds a threshold of 50%, and store the resulting binary masks and masked video. When not specified, the destination directories will appear in the output, along with a few more information monitoring the status of the execution.

How it works

Both the notebook and the script perform approximately the same operations. Indeed, these tasks are quite common in any deep learning or computer vision pipeline, and are summarized below:

  • Import relevant packages and utilities: here we declare functions for extracting each frame from the video, show the semantic segmentation results, and generate a mask based on a probability threshold
  • Load labels and colors: this is only useful to show you the list of available labels and show each detected classes with a different shade.
  • Load model: import the model checkpoints that have been downloaded and stored accordingly.
  • Load and pre-process data: here we load our input video, split it into frames, and normalize their pixel values to approximate a standard Gaussian distribution. Furthermore, the script divides the input frames into smaller batches that can be processed sequentially.
  • Run the prediction: one or more batches of frames are fed into the model. For each frame, the model returns a 3-dimensional matrix of prediction probabilities for each of the labels, for each of the pixels of the rescaled input image. Thus, our output score “frame” will be of size: [width / scale, height / scale, n_labels].
  • Extract the binary mask and apply it to the original frames: given a 3D matrix of scores, isolate the specified labels, and apply a white color when the combined probabilities exceed the threshold or otherwise or, otherwise, black. Subsequently, use the binary mask, appropriately scaled back to the same resolution as the input, to blend the original frame with a black “placeholder” image.
  • Store output: finally, store the binary masks and masked video frames in different subfolders.

For an in-depth explanation of how the underlying deep learning model works, refer to this paper; for a more intuitive explanation of semantic segmentation using convolutional neural networks, refer to this article.

How to use the output

The outputs from the script can be used in a variety of ways. For example, they can be fed into a video inpainting deep learning model such as this one, to remove specific elements from the source video. Alternatively, they can be loaded onto a VJing software such as Touchdesigner or Resolume, composited with other visual content, and mapped onto physical surfaces.

You can convert either sequence of frames back into a video with the following command:

Note: FFmpeg should already be available within the conda environment since it is a dependency of OpenCV. Otherwise, it can be installed using conda, Homebrew, or Chocolatey, depending on your operative system.

Lastly, if you wish to merge the audio content from the original video into the one exported using the command above, you can run:

In this case, it is important to export the frames using the exact framerate of the original video. You can gather that information with:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store