Search code examples
pythonlinuxtensorflowcondaminiconda

What is the correct way of setting up Tensorflow on Linux, after all?


I'm having some misinformation problem regarding Tensorflow. Lot's of info on lot's of places, and never complete enough.

I got my system set up with CUDA 8.0, cuDNN and I have Keras + Theano working ok with python 2.7. I'm trying to move to Tensorflow.

As I had compatibility problems with numpy and other stuff when I tried to install it in the same environment, I installed miniconda2, created a virtual env for it conda create -n tensorflow pip and activated it, as instructed here: https://www.tensorflow.org/install/install_linux#InstallingAnaconda

The environment seems operational.

Afterwards, I installed tensorflow from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.1-cp27-none-linux_x86_64.whl and also Keras, only to noticed I had some modules duplicated on conda list, some marked with a version string, others marked with <pip> only. Specially, I got one Tensorflow-gpu 1.2.1 and Tensorflow 1.1.0. Both of them. The old version just comes by with Keras.

Also, there's a myriad of warnings about Tensorflow not being compiled to use certain CPU instruction sets, and there's this answer How to compile Tensorflow with SSE4.2 and AVX instructions? about compiling it with using basel, but I don't really find any information about where to put the source code and what files to move to where after running that bazel command line.

To make matters worse, whenever I run a simple 20x20 matrix multiplication code with "/gpu:0" as device, the code list that horrendous warnings, correctly detects the presence of a GTX 1070, but never really confirms it was used to to the calculations. And it runs faster on "/cpu:0". How I miss Theano...

Could someone point me out where can I find:

  1. what version to download of Tensorflow that is current (not necessarily latest)?
  2. concise steps to get it done and how to test if those steps went right?

I'm using Linux Mint 18.


Solution

  • I have used conda and have installed Tensorflow=1.1.0, but it never seemed to have worked correctly within python. I also came across in github issues that anconda are currently working on the Tensorflow GPU version and so no matter what I tried in Anaconda, it never used my Tesla NVIDIA P100-SXM2-16GB card and it used only the CPU.

    I suggest you use the normal environment till they get Tensorflow-gpu to work right in Anaconda.

    To check if the tensorflow-gpu works I used the Inception v3 model with TF0.12 / TF1.0.

    This is the process that I go through to install tensorflow1.0:

    Step 0.

    sudo -i
    apt-get install aptitude
    aptitude install software-properties-common 
    apt-get install libcupti-dev pip
    apt-get update
    apt-get upgrade libc6
    

    Step 1. Install Nvidia Components. I think you already have that installed

    Download the NVIDIA cuDNN 5.1 for CUDA 8.0 from https://developer.nvidia.com/rdp/cudnn-download (Registration in NVIDIA's Accelerated Computing Developer Program is required)

    Cudnn 5.1 works well with most of the architectures and OS out there

    Step 2. Install bazel and tensorflow

    apt-get install bazel
    

    you can go to this link https://pypi.python.org/pypi/tensorflow-gpu/1.1.0rc0 and do a

    pip install <python-wheel-version>
    

    If you have python2.7 and python 3.* installed, then use pip2 to install for python2.7

    Step 3. Install openjdk

    apt-get install openjdk-8-jdk
    

    Step 4. git clone the Inception model code

    git clone https://github.com/tensorflow/models.git
    cd models
    git checkout master
    cd inception
    

    This is where bazel comes in the picture. See Bazel's Getting Started docs for a more detailed explanation of what a target is. So, if you do a

    ls -lstr
    

    you might see 5 bazel related symbolic links

    bazel-bin  bazel-genfiles  bazel-inception  bazel-out  bazel-testlogs 
    

    these are the target directory to which you build your specific model

    Assuming you're in the models/inception directory

    bazel build inception/imagenet_train
    

    This activates the symbolic link

    NOTE: For this imagenet_train.py to work you need to prepare the imagenet dataset. You either skip this part or go through this:

    STEP 5. Prepare the Imagenet dataset Before you run the training script for the first time, you will need to download and convert the ImageNet data to native TFRecord format. To begin, you will need to sign up for an account with ImageNet to gain access to the data. Look for the sign-up page, create an account and request an access key to download the data.

    After you have USERNAME and PASSWORD, you are ready to run our script. Make sure that your hard disk has at least 500 GB of free space for downloading and storing the data. Here we select DATA_DIR=$HOME/imagenet-data as such a location but feel free to edit accordingly.

    When you run the below script, please enter USERNAME and PASSWORD when prompted. This will occur at the very beginning. Once these values are entered, you will not need to interact with the script again.

    #location of where to place the ImageNet data 
    DATA_DIR=$HOME/imagenet-data
    

    Here $HOME is /root

    # build the preprocessing script.
    bazel build inception/download_and_preprocess_imagenet
    
    # run it
    bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}"
    # Place the tensor records at /root/dataset
    

    Step 6. Source bazel and tensorflow This step is very important. This will activate the python packages and I think you maybe getting errors because the python package for tensorflow is not activated. If you have skipped step 5 then you might want to go to

    /models/inception/sample
    

    and run the gpu.py script

    python gpu.py
    

    This should verify that your tensorflow version works with your gpu

    source /opt/DL/bazel/bin/bazel-activate
    source /opt/DL/tensorflow/bin/tensorflow-activate
    

    You also check by importing tensorflow into python eg: import tensorflow as tf

    find a hello world eg on their site and if this gives errors then it has not been installed properly

    Step 7. Run the imagenet training --You can skip this step if you have skipped step 5.

    bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=256 --train_dir=/tmp --data_dir=/root/dataset/ --max_steps=100