Matlab has a reputation for running slowly. Here are some pointers on how to speed computations, to an often unexpected degree. Subjects currently covered:

Matrix Coding

Implicit Multithreading on a Multicore Machine

Sparse Matrices

Sub-Block Computation to Avoid Memory Overflow

Matrix Coding - 1

Matlab documentation notes that efficient computation depends on using the matrix facilities, and that mathematically identical algorithms can have very different runtimes, but they are a bit coy about just what these differences are. A simple but telling example:

The following is the core of the GD-CLS algorithm of Berry et.al., copied from fig. 1 of Shahnaz et.al, 2006, "Document clustering using nonnegative matrix factorization':

```
for jj = 1:maxiter
A = W'*W + lambda*eye(k);
for ii = 1:n
b = W'*V(:,ii);
H(:,ii) = A \ b;
end
H = H .* (H>0);
W = W .* (V*H') ./ (W*(H*H') + 1e-9);
end
```

Replacing the columwise update of H with a matrix update gives:

```
for jj = 1:maxiter
A = W'*W + lambda*eye(k);
B = W'*V;
H = A \ B;
H = H .* (H>0);
W = W .* (V*H') ./ (W*(H*H') + 1e-9);
end
```

These were tested on an 8049 x 8660 sparse matrix bag of words V (.0083 non-zeros), with W of size 8049 x 50, H 50 x 8660, maxiter = 50, lambda = 0.1, and identical initial W. They were run consecutivly, multithreaded on an 8-processor Sun server, starting at ~7:30PM. Tic-toc timing was recorded.

Runtimes were respectivly 6586.2 and 70.5 seconds, a 93:1 difference. The maximum absolute pairwise difference between W matrix values was 6.6e-14.

Similar speedups have been consistantly observed in other cases. In one algorithm, combining matrix operations with efficient use of the sparse matrix facilities gave a 3600:1 speedup.

For speed alone, C-style iterative programming should be avoided wherever possible. In addition, when a couple lines of matrix code can substitute for an entire C-style function, program clarity is much improved.

Matrix Coding - 2

Applied to integration, the speed gains are not so great, largely due to the time taken to set up the and deal with the boundaries. The anyomous function setup time is neglegable. I demonstrate on a simple uniform step linearly interpolated 1-D integration of cos() from 0 to pi, which should yield zero:
tic;
step = .00001;
fun = @cos;
start = 0;
endit = pi;
enda = floor((endit - start)/step)*step + start;
delta = (endit - enda)/step;
intF = fun(start)/2;
intF = intF + fun(endit)*delta/2;
intF = intF + fun(enda)*(delta+1)/2;
for ii = start+step:step:enda-step
intF = intF + fun(ii);
end
intF = intF*step
toc;

intF = -2.910164109692914e-14
Elapsed time is 4.091038 seconds.

Replacing the inner summation loop with the matrix equivalent speeds things up a bit:

```
tic;
step = .00001;
fun = @cos;
start = 0;
endit = pi;
enda = floor((endit - start)/step)*step + start;
delta = (endit - enda)/step;
intF = fun(start)/2;
intF = intF + fun(endit)*delta/2;
intF = intF + fun(enda)*(delta+1)/2;
intF = intF + sum(fun(start+step:step:enda-step));
intF = intF*step
toc;
```

intF = -2.868419946011613e-14
Elapsed time is 0.141564 seconds.

The core computation takes about 70% of the full time used above, but without the boundary handling gives a potentially significant error:

```
tic; sum(fun(start:step:endit))*step, toc;
```

ans = 2.653589763815900e-06
Elapsed time is 0.099088 seconds. ----------------------------------------------------------------------------------------------------------------------

Implicit Multithreading on a Multicore Machine

Select File > Preferences > General > Multithreading. and enable multithreading. Enjoy.

See Multiprocessing in MATLAB on the help system for documentation.

Sparse Matrices

In Matlab, the sparse matrix facilities are activated by simply making a matrix sparse,as in Xs = sparse(X), and using it. Most of the matrix operations are able to make effecient use of truely sparse matrices, with considerable relief of memory demands. A simple example follows:

load ... /home/stutz/DaSH/IDU/ASAP/Runs-20080303_gdcls/gdcls_dataset.mat

X = Data0.X_T; % a terms*docs bag of word counts
nnz(X) / numel(X)
% ans = 0.0044

C = Data0.Cats; % a 0/1 valued non-exclusive category matrix
nnz(C) / numel(C)
% ans = 0.0329

The memory saved by using sparse matrices is not trivial:

Xf = full(X);
Cf = full(C);
Ct = C';
Ctf = Cf';
Xt = X';
Xtf = Xf';

whos X Xt Xf Xtf C Ct Cf Ctf

% Name Size Bytes Class Attributes
%
% C 11245x33 195536 double sparse

% Cf 11245x33 2968680 double

% Ct 33x11245 285232 double sparse

% Ctf 33x11245 2968680 double

% X 4686x11245 11649776 double sparse
% Xf 14686x11245 1321152560 double

% Xt 11245x14686 11677304 double sparse
% Xtf 11245x14686 1321152560 double

The disparity between C and Ct byte counts are due to the assymetry. Matlab's sparse representation requires a column index for non-empty columns plus a row index for each non-zero element.

The matrix multiplication timings vary significantly, but not quite as one might expect, due to the great disparity in matrix sizes:

tic; X*C; toc; % sparse*sparse
% Elapsed time is 0.100225, 0.108354, 0.115247, 0.102686 seconds.

tic; Ct*Xt; toc; % sparse*sparse
% Elapsed time is 0.153515, 0.146475, 0.145449, 0.156352 seconds.

tic; Ctf*Xt; toc; % full*sparse
% Elapsed 0.404232, 0.421417, 0.424837, 0.404256

tic; X*Cf; toc; % sparse*full
% Elapsed 0.762932, 1.148991, 0.727793, 0.717451

tic; Ctf*Xtf; toc; % full*full
% Elapsed 2.984380, 3.164782, 3.050478, 3.253079

tic; Xf*Cf; toc; % full*full
% Elapsed 3.423657, 3.503390, 3.854899, 3.555406

tic; Xf*C; toc; % full*sparse
% Elapsed 7.204459, 7.193888, 8.999477, 9.477529

tic; Ct*Xtf; toc; % sparse*full
% Elapsed 9.655192, 7.155294, 7.155073, 7.496134

The above was monitored with UNIX top, while running basic Matlab in multithreading mode on an 8-processor server. These calculations involved up to 16% of CPU, say 1.3 processors, but typically 12.5% or 1 processor, dropping to under 0.04% when idle.

Matrix normalization time differences are even greater, depite the advantage of multithreading The sparse matrix normalizations below ran with no more than 12.5% CPU. The following full matrix calculations quickly ramped up and held at 90-91.8% of CPU, say 7.3 processors, with only one significant (>0.2% CPU) competing process present.

tic; X / spdiags(sum( X.^2)', 0, 11245, 11245); toc;

% sparse/sparse(sparse)
% Elapsed32.781410, 30.889889, 30.978490, 32.140246,

ic; X / diag(sum(X.^2)); toc; % sparse/full(sparse)
% Elapsed 30.714821, 30.973805, 30.398733, 30.443698,

tic; Xf / diag(sum(Xf.^2)); toc; % full/full(full)

Killed Matlab (not responding) after 22 hours weekend clock time.

tic; Xf / spdiags(sum(Xf.^2)', 0, 11245, 11245); toc; %full/sparse(full)

## Killed Matlab (not responding) after 42 hours weekend clock time.

Sub-Block Computation to Avoid Memory Overflow

The downside of matrix computation is that intermediate values may be so large that the total memory allocation exceeds what is available. Using sparse matrices helps avoid this, but not all of our matrices are sparse. If a calculation can be done on a column by colume basis, or line by line, or page by page..., we can usually break it into blocks of computable size while retaining most of speed advantage of matrix calculation. Consider the following schematic:

function [...] = whatever(.....)
try

```
return; % straight matrix result
```

catch

```
try
while blocksize >= 1
try
for
<do calculation on a sub-block)
end
return % accumulated result
catch
blocksize = floor(blocksize / 2);
end % inner try
end % while
error('unable to complete whatever at blocksize == 1');
catch % middle try
error('unable to allocate accumulator(s)');
end % middle try
```

end % outer try

The initially tried matrix calculation both lets us deal effeciently with small matrices and provides good documentation for what we do in the more complex blocked calculation. The intermediate try allows a graceful exit when we cannot even build an accumulator, and is probably not needed when our result is of smaller dimension than the inputs. The inner try/while quickly runs our blocksize down to something that that works, so long as the initial block processed is one of the largest.