Efficient Matlab Programs

Metadata Updated: May 2, 2019

Matlab has a reputation for running slowly. Here are some pointers on how to speed computations, to an often unexpected degree. Subjects currently covered:

Matrix Coding

Implicit Multithreading on a Multicore Machine

Sparse Matrices

Sub-Block Computation to Avoid Memory Overflow


Matrix Coding - 1

Matlab documentation notes that efficient computation depends on using the matrix facilities, and that mathematically identical algorithms can have very different runtimes, but they are a bit coy about just what these differences are. A simple but telling example:

The following is the core of the GD-CLS algorithm of Berry et.al., copied from fig. 1 of Shahnaz et.al, 2006, "Document clustering using nonnegative matrix factorization':

     for jj = 1:maxiter
         A = W'*W + lambda*eye(k);
         for ii = 1:n
             b = W'*V(:,ii);
             H(:,ii) = A \ b;
         end
         H = H .* (H>0);
         W = W .* (V*H') ./ (W*(H*H') + 1e-9);
     end

Replacing the columwise update of H with a matrix update gives:

      for jj = 1:maxiter
          A = W'*W + lambda*eye(k);
          B = W'*V;
          H = A \ B;
          H = H .* (H>0);
          W = W .* (V*H') ./ (W*(H*H') + 1e-9);
      end

These were tested on an 8049 x 8660 sparse matrix bag of words V (.0083 non-zeros), with W of size 8049 x 50, H 50 x 8660, maxiter = 50, lambda = 0.1, and identical initial W. They were run consecutivly, multithreaded on an 8-processor Sun server, starting at ~7:30PM. Tic-toc timing was recorded.

Runtimes were respectivly 6586.2 and 70.5 seconds, a 93:1 difference. The maximum absolute pairwise difference between W matrix values was 6.6e-14.

Similar speedups have been consistantly observed in other cases. In one algorithm, combining matrix operations with efficient use of the sparse matrix facilities gave a 3600:1 speedup.

For speed alone, C-style iterative programming should be avoided wherever possible. In addition, when a couple lines of matrix code can substitute for an entire C-style function, program clarity is much improved.


Matrix Coding - 2

Applied to integration, the speed gains are not so great, largely due to the time taken to set up the and deal with the boundaries. The anyomous function setup time is neglegable. I demonstrate on a simple uniform step linearly interpolated 1-D integration of cos() from 0 to pi, which should yield zero: tic; step = .00001; fun = @cos; start = 0; endit = pi; enda = floor((endit - start)/step)step + start; delta = (endit - enda)/step; intF = fun(start)/2; intF = intF + fun(endit)delta/2; intF = intF + fun(enda)(delta+1)/2; for ii = start+step:step:enda-step intF = intF + fun(ii); end intF = intFstep toc;

intF = -2.910164109692914e-14 Elapsed time is 4.091038 seconds.

Replacing the inner summation loop with the matrix equivalent speeds things up a bit:

      tic;
      step = .00001;
      fun = @cos;
      start = 0;
      endit = pi;
      enda = floor((endit - start)/step)*step + start;
      delta = (endit - enda)/step;
      intF = fun(start)/2;
      intF = intF + fun(endit)*delta/2;
      intF = intF + fun(enda)*(delta+1)/2;
      intF = intF + sum(fun(start+step:step:enda-step));
      intF = intF*step
      toc;

intF = -2.868419946011613e-14 Elapsed time is 0.141564 seconds.

The core computation takes about 70% of the full time used above, but without the boundary handling gives a potentially significant error:

    tic; sum(fun(start:step:endit))*step,  toc;

ans = 2.653589763815900e-06 Elapsed time is 0.099088 seconds. ----------------------------------------------------------------------------------------------------------------------

Implicit Multithreading on a Multicore Machine

Select File > Preferences > General > Multithreading. and enable multithreading. Enjoy.

See Multiprocessing in MATLAB on the help system for documentation.


Sparse Matrices

In Matlab, the sparse matrix facilities are activated by simply making a matrix sparse,as in Xs = sparse(X), and using it. Most of the matrix operations are able to make effecient use of truely sparse matrices, with considerable relief of memory demands. A simple example follows:

load ... /home/stutz/DaSH/IDU/ASAP/Runs-20080303_gdcls/gdcls_dataset.mat

X = Data0.X_T; % a terms*docs bag of word counts nnz(X) / numel(X) % ans = 0.0044

C = Data0.Cats; % a 0/1 valued non-exclusive category matrix nnz(C) / numel(C) % ans = 0.0329

The memory saved by using sparse matrices is not trivial:

Xf = full(X); Cf = full(C); Ct = C'; Ctf = Cf'; Xt = X'; Xtf = Xf';

whos X Xt Xf Xtf C Ct Cf Ctf

% Name Size Bytes Class Attributes % % C 11245x33 195536 double sparse

% Cf 11245x33 2968680 double
% Ct 33x11245 285232 double sparse
% Ctf 33x11245 2968680 double
% X 4686x11245 11649776 double sparse % Xf 14686x11245 1321152560 double
% Xt 11245x14686 11677304 double sparse % Xtf 11245x14686 1321152560 double

The disparity between C and Ct byte counts are due to the assymetry. Matlab's sparse representation requires a column index for non-empty columns plus a row index for each non-zero element.

The matrix multiplication timings vary significantly, but not quite as one might expect, due to the great disparity in matrix sizes:

tic; XC; toc; % sparsesparse % Elapsed time is 0.100225, 0.108354, 0.115247, 0.102686 seconds.

tic; CtXt; toc; % sparsesparse % Elapsed time is 0.153515, 0.146475, 0.145449, 0.156352 seconds.

tic; CtfXt; toc; % fullsparse % Elapsed 0.404232, 0.421417, 0.424837, 0.404256

tic; XCf; toc; % sparsefull % Elapsed 0.762932, 1.148991, 0.727793, 0.717451

tic; CtfXtf; toc; % fullfull % Elapsed 2.984380, 3.164782, 3.050478, 3.253079

tic; XfCf; toc; % fullfull % Elapsed 3.423657, 3.503390, 3.854899, 3.555406

tic; XfC; toc; % fullsparse % Elapsed 7.204459, 7.193888, 8.999477, 9.477529

tic; CtXtf; toc; % sparsefull % Elapsed 9.655192, 7.155294, 7.155073, 7.496134

The above was monitored with UNIX top, while running basic Matlab in multithreading mode on an 8-processor server. These calculations involved up to 16% of CPU, say 1.3 processors, but typically 12.5% or 1 processor, dropping to under 0.04% when idle.

Matrix normalization time differences are even greater, depite the advantage of multithreading The sparse matrix normalizations below ran with no more than 12.5% CPU. The following full matrix calculations quickly ramped up and held at 90-91.8% of CPU, say 7.3 processors, with only one significant (>0.2% CPU) competing process present.

tic; X / spdiags(sum( X.^2)', 0, 11245, 11245); toc;

% sparse/sparse(sparse) % Elapsed32.781410, 30.889889, 30.978490, 32.140246,

ic; X / diag(sum(X.^2)); toc; % sparse/full(sparse) % Elapsed 30.714821, 30.973805, 30.398733, 30.443698,

tic; Xf / diag(sum(Xf.^2)); toc; % full/full(full)

Killed Matlab (not responding) after 22 hours weekend clock time.

tic; Xf / spdiags(sum(Xf.^2)', 0, 11245, 11245); toc; %full/sparse(full)

Killed Matlab (not responding) after 42 hours weekend clock time.

Sub-Block Computation to Avoid Memory Overflow

The downside of matrix computation is that intermediate values may be so large that the total memory allocation exceeds what is available. Using sparse matrices helps avoid this, but not all of our matrices are sparse. If a calculation can be done on a column by colume basis, or line by line, or page by page..., we can usually break it into blocks of computable size while retaining most of speed advantage of matrix calculation. Consider the following schematic:

function [...] = whatever(.....) try

return;  % straight matrix result

catch

try

    while blocksize >= 1
        try


            for 
                <do calculation on a sub-block)

            end
          return  % accumulated result
      catch
          blocksize = floor(blocksize / 2);
      end % inner try
  end % while
  error('unable to complete whatever at blocksize == 1');
catch % middle try
    error('unable to allocate accumulator(s)');
end % middle try

end % outer try

The initially tried matrix calculation both lets us deal effeciently with small matrices and provides good documentation for what we do in the more complex blocked calculation. The intermediate try allows a graceful exit when we cannot even build an accumulator, and is probably not needed when our result is of smaller dimension than the inputs. The inner try/while quickly runs our blocksize down to something that that works, so long as the initial block processed is one of the largest.


Access & Use Information

Public: This dataset is intended for public access and use. License: U.S. Government Work

Downloads & Resources

Dates

Metadata Created Date August 1, 2018
Metadata Updated Date May 2, 2019
Data Update Frequency irregular

Metadata Source

Harvested from NASA Data.json

Additional Metadata

Resource Type Dataset
Metadata Created Date August 1, 2018
Metadata Updated Date May 2, 2019
Publisher Dashlink
Unique Identifier DASHLINK_9
Maintainer
John Stutz
Maintainer Email
Public Access Level public
Data Update Frequency irregular
Bureau Code 026:00
Metadata Context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
Metadata Catalog ID https://data.nasa.gov/data.json
Schema Version https://project-open-data.cio.gov/v1.1/schema
Catalog Describedby https://project-open-data.cio.gov/v1.1/schema/catalog.json
Datagov Dedupe Retained 20190501230127
Harvest Object Id f1492fc8-8eb6-47a3-b9a2-71b5d160609f
Harvest Source Id 39e4ad2a-47ca-4507-8258-852babd0fd99
Harvest Source Title NASA Data.json
Data First Published 2010-09-09
Homepage URL https://c3.nasa.gov/dashlink/resources/9/
License http://www.usa.gov/publicdomain/label/1.0/
Data Last Modified 2018-07-19
Program Code 026:029
Source Datajson Identifier True
Source Hash bea1c5db23bf45498dba4a825f926c9c02799933
Source Schema Version 1.1

Didn't find what you're looking for? Suggest a dataset here.