
By: Venkatesh Yadav
TensorFlow on AWS GPU instance
In this tutorial, we show how to setup TensorFlow on AWS GPU instance and run H2O Tensorflow Deep learning demo.
Pre-requisites:
To get started, request an AWS EC2 instance with GPU support. We used a single g2.2xlarge instance running Ubuntu 14.04.To setup TensorFlow with GPU support, following softwares should be installed:
- Java 1.8
- Python pip
- Unzip utility
- CUDA Toolkit (>= v7.0)
- cuDNN (v4.0)
- Bazel (>= v0.2)
- TensorFlow (v0.9)
To run H2O Tensorflow Deep learning demo, following softwares should be installed:
- IPython notebook
- Scala
- Spark
- Sparkling water
Software Installation:
Java:
#To install Java follow below steps: Type ‘Y’ on installation prompt
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
Update JAVA_HOME in ~/.bashrc
#Add JAVA_HOME to PATH:
export PATH=$PATH:$JAVA_HOME/bin
# Execute following command to update current session:
source ~/.bashrc
#Verify version and path:
java -version
echo $JAVA_HOME
Python:
#AWS EC2 instance has Python installed by default. Verify if Python 2.7 is installed already:
python -V
#Install pip
sudo apt-get install python-pip
#Install IPython notebook
sudo pip install "ipython[notebook]"
#To run H2O example notebooks, execute following commands:
sudo pip install requests
sudo pip install tabulate
Unzip utility:
#Execute following command to install unzip
sudo apt-get install unzip
Scala:
#Follow below mentioned steps: Type ‘Y’ on installation prompt
sudo apt-get install scala
#Update SCALA_HOME in ~/.bashrc and execute following command to update current session:
source ~/.bashrc
#Verify version and path:
scala -version
echo $SCALA_HOME
Spark:
#Java and Scala should be installed before installing Spark.
#Get latest version of Spark binary:
wget http://apache.cs.utah.edu/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
#Extract the file:
tar xvzf spark-1.6.1-bin-hadoop2.6.tgz
#Update SPARK_HOME in ~/.bashrc and execute following command to update current session:
source ~/.bashrc
#Add SPARK_HOME to PATH:
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
#Verify the variables:
echo $SPARK_HOME
Sparkling Water:
#Latest Spark pre-built for Hadoop should be installed and point SPARK_HOME to it:
export SPARK_HOME="/path/to/spark/installation"
#To launch a local Spark cluster with 3 worker nodes with 2 cores and 1g per node, export MASTER variable
export MASTER="local-cluster[3,2,1024]"
#Download and run Sparkling Water
wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.6/5/sparkling-water-1.6.5.zip
unzip sparkling-water-1.6.5.zip
cd sparkling-water-1.6.5
bin/sparkling-shell --conf "spark.executor.memory=1g"
CUDA Toolkit:
#In order to build or run TensorFlow with GPU support, both NVIDIA’s Cuda Toolkit (>= 7.0) and cuDNN (>= v2) need to be installed.
#To install CUDA toolkit, run:
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/x86_64/cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install cuda
cuDNN:
#To install cuDNN, download a file named cudnn-7.0-linux-x64-v4.0-prod.tgz after filling NVIDIA questionnaire.
#You need to transfer it to your EC2 instance’s home directory.
tar -zxf cudnn-7.0-linux-x64-v4.0-prod.tgz &&
rm cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp -R cuda/lib64 /usr/local/cuda/lib64
sudo cp ~/cuda/include/cudnn.h /usr/local/cuda
#Reboot the system
sudo reboot
#Update environment variables as shown below:
export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=$PATH:$CUDA_ROOT/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64
Bazel:
#To instal Bazel(>= v0.2), run:
sudo apt-get install pkg-config zip g++ zlib1g-dev
wget https://github.com/bazelbuild/bazel/releases/download/0.3.0/bazel-0.3.0-installer-linux-x86_64.sh
chmod +x bazel-0.3.0-installer-linux-x86_64.sh
./bazel-0.3.0-installer-linux-x86_64.sh --user
TensorFlow:
#Download and install TensorFlow:
wget https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl
sudo pip install --upgrade tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl
#Configure TF with GPU support enabled using:
./configure
To build TensorFlow, run:
bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-0.8.0-py2-none-any.whl
Run H2O Tensorflow Deep learning demo:
#Since, we want to open IPython notebook remotely, we will use IP and port option. To start TensorFlow notebook:
cd sparkling-water-1.6.5/
IPYTHON_OPTS="notebook --no-browser --ip='*' --port=54321" bin/pysparkling
#Note that port specified in above command should be open in the system.
Open http://PublicIP:8888 in browser to start IPython notebook console.
Click on TensorFlowDeepLearning.ipynb
Refer this video for demo details.
#Sample .bashrc contents:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SCALA_HOME=/usr/share/java
export SPARK_HOME=/home/ubuntu/spark-1.6.1-bin-hadoop2.6
export MASTER="local-cluster[3,2,1024]"
export PATH=$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/home/ubuntu/spark-1.6.1-bin-hadoop2.6/bin:/home/ubuntu/spark-1.6.1-bin-hadoop2.6/sbin:/usr/local/cuda/bin:/home/ubuntu/bin
export LD_LIBRARY_PATH=:/usr/local/cuda/lib64
Troubleshooting:
1) ERROR: Getting java.net.UnknownHostException while starting spark-shell
Solution:
Make sure /etc/hosts has entry for hostname.
Eg: 127.0.0.1 hostname
2) ERROR: Getting Could not find .egg-info directory in install record error during IPython installation
Solution:
sudo pip install --upgrade setuptools pip
3) ERROR: Can’t find swig while configuring TF
Solution:
sudo apt-get install swig
4) ERROR: “Ignoring gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5”
Solution:
Specify 3.0 while configuring TF at:
Please note that each additional compute capability significantly increases your build time and binary size.
5) ERROR: Could not insert ‘nvidia_352’: Unknown symbol in module, or unknown parameter (see dmesg)
Solution:
sudo apt-get install linux-image-extra-virtual
6) ERROR: Cannot find ’./util/python/python_include
Solution:
sudo apt-get install python-dev
7) Find Public IP address of system
Solution:
curl http://169.254.169.254/latest/meta-data/public-ipv4
Demo Videos