I started trying Theano today and wanted to use the GPU (NVIDIA GeForce GT 750M 2048 MB) on my Mac. Here’s a brief instruction on how to use the GPU on Mac, largely following the instructions from http://deeplearning.net/software/theano/install.html#mac-os.
Install Theano:
$ pip install Theano
Download and install CUDA: https://developer.nvidia.com/cuda-downloads
Put the following lines into your ~/.bash_profile:
# Theano and CUDA PATH="/Developer/NVIDIA/CUDA-7.5/bin/:$PATH" export LD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-7.5/lib/ export CUDA_ROOT=/Developer/NVIDIA/CUDA-7.5/ export THEANO_FLAGS='mode=FAST_RUN,device=gpu,floatX=float32'
Note that the PATH line is necessary. Otherwise you may see the following message:
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
Configure Theano:
$ cat .theanorc [gcc] cxxflags = -L/usr/local/lib -L/Developer/NVIDIA/CUDA-7.5/lib/
Test if GPU is used:
$ cat check.py from theano import function, config, shared, sandbox import theano.tensor as T import numpy import time vlen = 10 * 30 * 768 # 10 x #cores x # threads per core iters = 1000 rng = numpy.random.RandomState(22) x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) f = function([], T.exp(x)) print(f.maker.fgraph.toposort()) t0 = time.time() for i in xrange(iters): r = f() t1 = time.time() print("Looping %d times took %f seconds" % (iters, t1 - t0)) print("Result is %s" % (r,)) if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]): print('Used the cpu') else: print('Used the gpu') $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 time python check.py [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)] Looping 1000 times took 1.743682 seconds Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761 1.62323284] Used the cpu 2.47 real 2.19 user 0.27 sys $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 time python check.py Using gpu device 0: GeForce GT 750M [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)] Looping 1000 times took 1.186971 seconds Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296] Used the gpu 2.09 real 1.59 user 0.41 sys
A more realistic example:
$ cat lr.py import numpy import theano import theano.tensor as T rng = numpy.random N = 400 feats = 784 D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX)) training_steps = 10000 # Declare Theano symbolic variables x = T.matrix("x") y = T.vector("y") w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w") b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b") x.tag.test_value = D[0] y.tag.test_value = D[1] # Construct Theano expression graph p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one prediction = p_1 > 0.5 # The prediction that is done: 0 or 1 xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize gw,gb = T.grad(cost, [w,b]) # Compile expressions to functions train = theano.function( inputs=[x,y], outputs=[prediction, xent], updates=[(w, w-0.01*gw), (b, b-0.01*gb)], name = "train") predict = theano.function(inputs=[x], outputs=prediction, name = "predict") if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in train.maker.fgraph.toposort()]): print('Used the cpu') elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in train.maker.fgraph.toposort()]): print('Used the gpu') else: print('ERROR, not able to tell if theano used the cpu or the gpu') print(train.maker.fgraph.toposort()) for i in range(training_steps): pred, err = train(D[0], D[1]) print("target values for D") print(D[1]) print("prediction on D") print(predict(D[0])) $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 time python lr.py Used the cpu target values for D [ 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1.] prediction on D [1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1] 8.92 real 8.24 user 1.14 sys $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 time python lr.py Using gpu device 0: GeForce GT 750M Used the gpu target values for D [ 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0.] prediction on D [1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0] 19.78 real 17.61 user 1.24 sys
So it seems this GPU does not outperform the CPU. Well,GT 750M may not be the best GPU you can get… Someone else here has a similar experience.
Hi Daoyuan,
That is a nice tutorial.
What is your computer set up?
2017 Mac do not use NVIDIA cards. CUDA won’t install on the card they chosen (Radeon).
Without CUDA, Theano will not be able to use the GPU. However, it can still run on the CPU alone although at much reduced speed.
There is currently no easy solution for Mac users and that is why we are switching to Linux environments on PC desktops.
Gilles
Hi Gilles,
My Mac was from late 2014 with NVIDIA GeForce GT 750M. Since CUDA is designed specifically for NVdia GPUs, it will not work with AMD GPUs…
Hi Daoyuan,
That is a nice tutorial.
What is your computer set up?
2017 Mac do not use NVIDIA cards. CUDA won’t install on the card they chosen (Radeon).
Without CUDA, Theano will not be able to use the GPU. However, it can still run on the CPU alone although at much reduced speed.
There is currently no easy solution for Mac users and that is why we are switching to Linux environments on PC desktops.
Gilles
Thank you for you tutorial, it works!
Thank you for you tutorial, it works!