Installing Theano and CUDA on Mac OS X

I started trying Theano today and wanted to use the GPU (NVIDIA GeForce GT 750M 2048 MB) on my Mac. Here’s a brief instruction on how to use the GPU on Mac, largely following the instructions from http://deeplearning.net/software/theano/install.html#mac-os.

Install Theano:

$ pip install Theano

Download and install CUDA: https://developer.nvidia.com/cuda-downloads

Put the following lines into your ~/.bash_profile:

# Theano and CUDA
PATH="/Developer/NVIDIA/CUDA-7.5/bin/:$PATH"
export LD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-7.5/lib/
export CUDA_ROOT=/Developer/NVIDIA/CUDA-7.5/
export THEANO_FLAGS='mode=FAST_RUN,device=gpu,floatX=float32'

Note that the PATH line is necessary. Otherwise you may see the following message:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

Configure Theano:

$ cat .theanorc 
[gcc]
cxxflags = -L/usr/local/lib -L/Developer/NVIDIA/CUDA-7.5/lib/

Test if GPU is used:

$ cat check.py 
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 time python check.py 
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 1.743682 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu
        2.47 real         2.19 user         0.27 sys
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 time python check.py 
Using gpu device 0: GeForce GT 750M
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.186971 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu
        2.09 real         1.59 user         0.41 sys

A more realistic example:

$ cat lr.py 
import numpy
import theano
import theano.tensor as T
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])

# Compile expressions to functions
train = theano.function(
            inputs=[x,y],
            outputs=[prediction, xent],
            updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
            name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
            name = "predict")

if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
        train.maker.fgraph.toposort()]):
    print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
          train.maker.fgraph.toposort()]):
    print('Used the gpu')
else:
    print('ERROR, not able to tell if theano used the cpu or the gpu')
    print(train.maker.fgraph.toposort())

for i in range(training_steps):
    pred, err = train(D[0], D[1])

print("target values for D")
print(D[1])

print("prediction on D")
print(predict(D[0]))
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 time python lr.py 
Used the cpu
target values for D
[ 1.  1.  0.  1.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.
  1.  0.  0.  1.  0.  0.  1.  1.  0.  1.  1.  1.  1.  0.  1.  1.  0.  1.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  1.  1.  0.  1.  1.  1.  0.  1.  0.
  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  1.  0.  0.  1.  1.
  1.  1.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.  1.  1.  0.  1.
  0.  0.  0.  0.  1.  0.  0.  1.  1.  1.  0.  0.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  0.  1.  1.  0.  0.  1.  0.  0.  0.  1.  0.  1.  1.  1.
  1.  0.  0.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.
  1.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.
  1.  0.  1.  0.  0.  1.  0.  0.  1.  1.  1.  1.  0.  1.  0.  0.  1.  0.
  0.  0.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  0.  1.  0.
  0.  1.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  1.
  1.  0.  1.  1.  1.  0.  0.  1.  1.  1.  1.  0.  0.  0.  1.  1.  0.  0.
  1.  0.  0.  0.  0.  1.  1.  1.  0.  1.  1.  1.  0.  1.  0.  0.  0.  0.
  0.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  0.  1.  0.  1.  0.  1.
  1.  0.  0.  0.  1.  1.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.  1.
  0.  1.  0.  1.  1.  0.  1.  1.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.
  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.  0.  1.  0.  1.  0.  0.  1.  1.
  0.  1.  1.  0.  0.  1.  1.  0.  0.  1.  0.  1.  1.  0.  0.  0.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  1.  1.  1.  0.  1.  1.  1.  1.
  1.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.
  0.  1.  1.  1.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  1.  0.  0.
  1.  1.  0.  1.]
prediction on D
[1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 0
 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 1
 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1
 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 0
 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1
 0 0 1 1 1 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1
 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 1 0 0
 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 1
 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1]
        8.92 real         8.24 user         1.14 sys
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 time python lr.py 
Using gpu device 0: GeForce GT 750M
Used the gpu
target values for D
[ 1.  0.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.  0.  0.  1.  1.
  0.  0.  0.  1.  1.  0.  1.  1.  1.  0.  0.  1.  1.  1.  1.  1.  1.  0.
  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.  0.  1.  0.  1.  1.  0.  1.  1.
  1.  0.  1.  1.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  1.  1.  0.  1.
  1.  1.  1.  0.  1.  1.  0.  1.  1.  1.  0.  0.  0.  1.  1.  0.  0.  0.
  1.  0.  1.  0.  0.  0.  0.  1.  1.  1.  1.  0.  0.  1.  0.  1.  0.  1.
  1.  0.  1.  1.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  1.  0.  1.
  1.  1.  0.  0.  0.  1.  0.  1.  0.  1.  0.  1.  1.  1.  1.  1.  0.  1.
  1.  0.  1.  1.  0.  0.  1.  0.  1.  0.  0.  1.  0.  0.  1.  0.  0.  0.
  1.  0.  0.  1.  1.  1.  1.  0.  0.  0.  1.  1.  1.  0.  1.  0.  0.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  0.  1.
  0.  1.  0.  1.  1.  1.  1.  0.  0.  0.  1.  1.  1.  1.  0.  0.  0.  1.
  0.  1.  1.  1.  0.  1.  1.  1.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.
  0.  0.  1.  1.  0.  1.  0.  1.  1.  1.  0.  0.  1.  1.  0.  0.  0.  0.
  1.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  1.  1.  1.  1.  1.  1.  1.
  0.  1.  1.  0.  0.  0.  1.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.
  1.  1.  1.  0.  0.  1.  0.  1.  0.  0.  1.  0.  1.  0.  0.  1.  0.  0.
  1.  1.  0.  1.  1.  1.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  1.
  0.  0.  1.  1.  0.  1.  1.  0.  1.  1.  1.  0.  1.  1.  0.  0.  0.  0.
  0.  0.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  0.  1.
  1.  1.  0.  1.  1.  0.  1.  1.  1.  0.  0.  1.  1.  0.  0.  0.  0.  0.
  1.  0.  0.  1.  1.  1.  0.  1.  0.  0.  1.  1.  0.  1.  1.  0.  1.  1.
  0.  0.  1.  0.]
prediction on D
[1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0
 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1
 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 1 1 0 1
 1 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1
 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1
 1 1 1 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1
 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0
 0 1 0 0 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1
 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1
 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 1 0
 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0]
       19.78 real        17.61 user         1.24 sys

So it seems this GPU does not outperform the CPU. Well,GT 750M may not be the best GPU you can get… Someone else here has a similar experience.

5 Comments

Gilles Daquin
Hi Daoyuan,
That is a nice tutorial.
What is your computer set up?
2017 Mac do not use NVIDIA cards. CUDA won’t install on the card they chosen (Radeon).
Without CUDA, Theano will not be able to use the GPU. However, it can still run on the CPU alone although at much reduced speed.
There is currently no easy solution for Mac users and that is why we are switching to Linux environments on PC desktops.
Gilles
2017/03/13 Reply
- Daoyuan Li
  Hi Gilles,
  My Mac was from late 2014 with NVIDIA GeForce GT 750M. Since CUDA is designed specifically for NVdia GPUs, it will not work with AMD GPUs…
  2017/03/13 Reply
Gilles Daquin
Hi Daoyuan,
That is a nice tutorial.
What is your computer set up?
2017 Mac do not use NVIDIA cards. CUDA won’t install on the card they chosen (Radeon).
Without CUDA, Theano will not be able to use the GPU. However, it can still run on the CPU alone although at much reduced speed.
There is currently no easy solution for Mac users and that is why we are switching to Linux environments on PC desktops.
Gilles
2018/03/14 Reply
Anonymous
Thank you for you tutorial, it works!
2016/07/09 Reply
Anonymous
Thank you for you tutorial, it works!
2018/03/14 Reply

Installing Theano and CUDA on Mac OS X

5 Comments

Leave a Reply Cancel reply