Parallel and GPU computing
Matlab has made it extremely easy to use parallel and GPU computing. For parallel computing the most direct and easiest way to implement it is through the "parfor" command.
Contents
Parallel Computing
To initialize the set of parallel "workers" that will be used, use the "parpool" command.
parpool
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. ans = Pool with properties: Connected: true NumWorkers: 4 Cluster: local AttachedFiles: {} IdleTimeout: 30 minute(s) (30 minutes remaining) SpmdEnabled: true
This will use the default cluster profile, and for most personal computers this will work fine. However, for other types of systems you can also use "parpool" to configure the workers how you like.
If you have a range of consecutive, finite integers over which you are performing a "for" loop, "parfor" will most likely be faster. Consider the following loop.
n=500; ranksSingle=zeros(1,n); tic for ind=1:n ranksSingle(ind) = rank(magic(ind)); end toc
Elapsed time is 13.223955 seconds.
Now consider:
ranks=zeros(1,n); tic parfor ind=1:n ranks(ind) = rank(magic(ind)); end toc
Elapsed time is 5.658752 seconds.
Depending on the speed of your processors and the number of cores your computer has this will either be substantially or somewhat faster than the serial "for". In most cases this is all the parallelization you will need to see significant speed up in your code.
If you are working on a distributed cluster or are using a particular Matlab Toolbox please visit: https://www.mathworks.com/help/distcomp/getting-started-with-parallel-computing-toolbox.html Here you can find information and examples for using the other parallel features.
% Aaron's Laptop: % 5.88s 1 worker % 2.783 4 workers
close all;clear;
GPU Computing
If you have many small tasks which do not require much memory (running an ensemble of stochastic simulations is the classic example) then GPU computing can offer immense speed ups.
The Graphical Processing Unit (GPU) is a dedicated piece of hardware that is primarily used when displaying or rendering graphics. It has a different architecture than the standard Central Processing Unit (CPU) that most computing is done with. Specifically, it has a huge number of cores (usually more than 1000) but relatively low amounts of memory which is shared between cores. When displaying or rendering graphics many vector operations must be performed. As graphics became more and more important a separate unit was developed to take the load off of the CPU.
What is key to remember when working with GPU computing is that the GPU is a separate unit with its own memory which is separate from the CPU. To create an array on the GPU you use the "gpuArray" class which has many of the standard array functions, but creates it on the GPU.
Here we will be following the example on: https://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html This renders a visualization of the Mandelbrot Set. This example is without using the GPU:
maxIterations = 500; gridSize = 1000; xlim = [-0.748766713922161, -0.748766707771757]; ylim = [ 0.123640844894862, 0.123640851045266];
Setup
t = tic(); x = linspace( xlim(1), xlim(2), gridSize ); y = linspace( ylim(1), ylim(2), gridSize ); [xGrid,yGrid] = meshgrid( x, y ); z0 = xGrid + 1i*yGrid; count = ones( size(z0) );
Calculate
z = z0; for n = 0:maxIterations z = z.*z + z0; inside = abs( z )<=2; count = count + inside; end count = log( count );
Show
cpuTime = toc( t ); figure(1) imagesc( x, y, count ); axis off title( sprintf( '%1.2fsecs (without GPU)', cpuTime ) );
We now begin the GPU version.
Setup
t = tic(); % We use the gpuArray.linspace to create an equispaced grid IN THE GPU % MEMORY. x = gpuArray.linspace( xlim(1), xlim(2), gridSize ); y = gpuArray.linspace( ylim(1), ylim(2), gridSize ); [xGrid,yGrid] = meshgrid( x, y ); z0 = complex( xGrid, yGrid );
This is another way of initializing an array on the GPU.
count = ones( size(z0), 'gpuArray' );
Now we can work in the standard way with these objects.
Calculate
z = z0; for n = 0:maxIterations z = z.*z + z0; inside = abs( z )<=2; count = count + inside; end count = log( count );
Since the variable "count" was initialized on the GPU we don't actually have the computed data in the processor (and hence Matlab workspace) memory. To bring it back to the CPU we use the "gather" command.
count = gather( count ); % Fetch the data back from the GPU. GPUTime = toc( t ); figure(2) imagesc( x, y, count ) axis off title( sprintf( '%1.3fsecs (GPU) = %1.1fx faster', ... GPUTime, cpuTime/GPUTime ) ) % Aaron's Laptop (GTX 860M) % 19.07s CPU % 2.256s GPU delete(gcp('nocreate'))
Parallel pool using the 'local' profile is shutting down.