Using Tensorflow¶
About TensorFlow¶
TensorFlow is an end-to-end open source platform for machine learning (ML). It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. Official TensorFlow documentation can be found here: TensorFlow Guide.
To use TensorFlow, you may either (a) load the module files for the TensorFlow versions that are installed on the cluster, or (b) install the TensorFlow version of your choice into your local Python library collection.
Using the TensorFlow Modules¶
On the Great Lakes cluster, there are three TensorFlow modules
available for use currently. The versions are represented by
tensorflow/1.?.?
and tensorflow/2.?.?
, where
?.?
indicates the versioning of the given release. To
determine the exact versions of TensorFlow modules available, use
$ module spider tensorflow
module
load
command. As an example, to load the TensorFlow module for
version 2.5.0, you would enter the following command:
$ module load tensorflow/2.5.0
module list
command. The list command for this particular version of TensorFlow
would show all of dependent modules which have been loaded in addition
to the TensorFlow module:
1) python3.8-anaconda/2021.05 2) cuda/11.2.1 3) cudnn/11.2-v8.1.0 4) tensorflow/2.5.0
$ pip install --user <package_name>
You only need to install Python packages once for each cluster on which you wish to use the library and, separately, for each version of Python that you use.
Installing TensorFlow¶
As an alternative to the TensorFlow modules, you may wish to install a specific version of TensorFlow into your personal Python library collection. As explained above, you will need to install Python packages once for each cluster on which you wish to use the library and, separately, for each version of Python that you use.
Version 2¶
The most recent version of Anaconda that is compatible with TensorFlow 2, at the time of this writing, is that which provides Python version 3.8. To install TensorFlow 2, you must first load the python3.8-anaconda module as follows
$ module load python3.8-anaconda
--user
tag which will, by
default, place packages in
$HOME/.local/lib/python3.8/site-packages
When a different version of Python is used, the path would reflect the
given version number in place of 3.8
. The library will then be
available to you for this and future sessions.
To install the TensorFlow 2 package, the pip install command is
$ pip install --user "tensorflow > 2"
Version 1¶
The most recent version of Anaconda that is compatible with TensorFlow 1 is that which provides Python version 3.7. To install TensorFlow 1, you must first load the python3.7-anaconda module as follows
$ module load python3.7-anaconda
With the python3.7-anaconda module loaded, you will then be able
to install Python packages into your personal library using the
pip command with the --user
tag as described above.
To install the most recent TensorFlow 1 package (version 1.15.5 at the time of this writing), it is necessary to install a separate version for GPU use. For both CPU and GPU capability, the pip install commands are
$ pip install --user "tensorflow < 2"
$ pip install --user "tensorflow-gpu < 2"
Beginning with TensorFlow 2, installation of a separate package for use with a GPU device is no longer necessary. TensorFlow installation for any version >=2 can be completed in one step, as described above.
TensorFlow Test¶
To ensure that your TensorFlow package is working properly, run the short test script tf-2.py, located in the examples directory, from a GPU node. If testing TensorFlow 1, modify the test to use the tf-1.py script, also found in the examples directory, instead. The following modules must be loaded to use TensorFlow with a GPU device: Anaconda3, CUDA, and cuDNN.
- Anaconda provides a python environment with over 200 packages pre-installed
- CUDA is a parallel computing platform and programming model for computing on GPUs
- cuDNN is a GPU-accelerated library of primitives for deep neural networks
The below Slurm script will initiate a job on a GPU node and run the test script.
#!/bin/bash
#SBATCH --job-name=tf_test
#SBATCH --account=<your-account>
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --time=15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=5gb
#SBATCH --mail-type=FAIL
# Load these modules if you are testing your own tensorflow installation
module load python3.8-anaconda/2021.05
module load cuda/11.2.1 cudnn/11.2-v8.1.0
# Comment out the above two module load commands and uncomment the below
# module load command to run the test with the tensorflow module instead
# module load tensorflow/2.5.0
module list
# Run the test
python3 /sw/examples/tensorflow/tf-2.py
Copy and paste the text above into a new Slurm batch script file such
as tf-test.sbat
, put your Slurm account name in place of
<your-account>
, and run the Slurm script with sbatch
via,
$ sbatch tf-test.sbat
The last few lines of output produced from running the Slurm script on a GPU node, excluding possible warning messages, should include content similar to the following:
$ tail slurm-<jobID>.out | grep -v deprecated
2019-10-24 11:04:55.073023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326]
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
15022 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id:
0000:d8:00.0, compute capability: 7.0)
[[4 6 8]
[4 6 8]]
Specifically, it should identify a GPU device as well as the
calculation result. Standard output will print to a file with the
default naming convention of slurm-<jobID>.out
, or on the command
line for an interactive bash job. If the example runs without errors,
everything is good!
If you are using TensorFlow without a GPU, the output of the example test will not include a line with the GPU specs. Instead, the last couple of lines will be as follows:
2019-10-24 17:18:08.036518: I tensorflow/compiler/xla/service/service.cc:176]
StreamExecutor device (0): Host, Default Version
[[4 6 8]
[4 6 8]]