\documentstyle[twoside]{article}
\oddsidemargin 0.0in 
\evensidemargin 0.0in
\pagestyle{headings}
\topmargin -0.5in
\textheight 9.0in 
\textwidth 6.5in 
\def \la{\langle}
\def \ra{\rangle}
\def \lp{\left(}
\def \rp{\right)}
\def \lb{\left[}
\def \rb{\right]}
\def \lc{\left\{}
\def \rc{\right\}}
\def \om{\omega}
\def \oom{\frac{1}{\omega}}
\def \ha{\frac{1}{2}}
\def \sA{{\bf A}}
\def \su{{\bf u}}
\def \sb{{\bf b}}
\def \sr{{\bf r}}
\def \pD{D^{(\pi)}}
\def \pC{C^{(\pi)}}
\def \pL{C_L^{(\pi)}}
\def \pU{C_U^{(\pi)}}
\begin{document}
\setcounter{page}{0}
\vspace*{1.0in}
\centerline{\bf NSPCG User's Guide}
\centerline{\bf Version 1.0}
\vspace*{0.24in}
\centerline{A Package for}
\centerline{Solving Large Sparse Linear Systems by}
\centerline{Various Iterative Methods}
\vspace*{0.24in}
\centerline{Thomas C. Oppe}
\centerline{Wayne D. Joubert}
\centerline{David R. Kincaid}
\vspace*{0.24in}
\centerline{April 1988 \hspace*{1.0in} CNA-216}
\vspace*{3.5in}
\noindent
This work was supported in part by the National Science Foundation under
Grant CCR-8518722, the Department of Energy under Grant DE-FG05-87ER25048,
and Cray Research, Inc. under Grant LTR DTD with The University of Texas 
at Austin.  Thomas C.  Oppe's participation was supported in part by the 
Control Data Corporation through its PACER Fellowship Program (1985-87) 
and by Sandia National Laboratory through Contract No. 06-4298 (1987-88).
\newpage
\setcounter{page}{0}
\title{NSPCG User's Guide 
         \thanks{ This work was supported in part by the National 
         Science Foundation under Grant CCR-8518722, the Department 
         of Energy under Grant DE-FG05-87ER25048, and Cray Research,
         Inc. under Grant LTR DTD with The University of Texas at 
         Austin.  Thomas C.  Oppe's participation was supported 
         in part by the Control Data Corporation through its PACER 
         Fellowship Program (1985-87) and by Sandia National Laboratory 
         through Contract No. 06-4298 (1987-88). } \\
         \large{Version 1.0}}
\author{A Package for \\
        Solving Large Sparse Linear Systems by \\
        Various Iterative Methods \\ \\ 
        Thomas C. Oppe \\ Wayne D. Joubert \\ David R. Kincaid \\ \\ 
        Center for Numerical Analysis \\
        The University of Texas at Austin}
\date{April, 1988}
\maketitle
\begin{abstract}

   NSPCG (for Nonsymmetric Preconditioned Conjugate Gradient) is
a computer package to solve the linear system $Au=b$ by various
iterative methods.  The coefficient matrix $A$ is assumed to be large and
sparse with real coefficients.  A wide selection of preconditioners 
and accelerators is available for both symmetric and nonsymmetric 
coefficient matrices.  Several sparse matrix data structures are
available to represent matrices whose structures range from highly
structured to completely unstructured.
\end{abstract}
\newpage
\centerline{\bf Disclaimer}
\bigskip

   The University of Texas at Austin and the Center for Numerical 
Analysis (CNA) disclaim all warranties with regard to the NSPCG 
software package, including all warranties of merchantability and 
fitness, and any stated express warranties are in lieu of all 
obligations or liability on the part of The University or the CNA 
for damages, including, but not limited to, special, indirect, or 
consequential damages arising out of or in connection with the 
use or performance of NSPCG.  In no event will The University or 
the CNA be liable for any direct, indirect, special, incidental, or 
consequential damages arising in connection with use of or 
inability to use NSPCG or the documentation.

   It should be emphasized that the NSPCG package is still in preliminary
and incomplete form.  For example, 
\begin{itemize}
 \item The use of adaptive procedures for $\om$ in the SOR and SSOR 
       methods assume a symmetric or nearly symmetric matrix.  In 
       the nonsymmetric case, the user must supply a good choice of 
       $\om$ when using SSOR with any of the nonsymmetric accelerators, 
       and this value will not change during the iterations.  
   
 \item Eigenvalue estimation procedures are available for only certain 
       of the nonsymmetric accelerators.

 \item Not all accelerators allow right-oriented or split preconditioners.

 \item Not all preconditioners are available for each of the allowed data 
       structures in NSPCG.
\end{itemize}

   A limited number of copies of the NSPCG package will be made available
on request with the understanding that it is intended as a research tool
and will undergo further development.  The interested reader should write
to the address below for additional information on obtaining the 
distribution tape of the program.  Also, reports of difficulties encountered
in using the system or comments and suggestions for improving the package 
would be welcome.
\bigskip

\begin{tabular}{l}
Center for Numerical Analysis \\
RLM Bldg. 13.150 \\
University of Texas at Austin \\
Austin, Texas \ \ 78713-8510 
\end{tabular}
\newpage
\tableofcontents
\newpage
\listoftables
\newpage
\section{Introduction}
\label{intro}
\indent
 
   NSPCG is a computer package to solve large sparse systems of linear
equations by iterative methods.  The package uses various acceleration
techniques, such as the conjugate gradient method, Chebyshev
acceleration and various generalized conjugate gradient methods for
nonsymmetric systems, in conjunction with various preconditioners
(or, basic iterative methods).  NSPCG was developed as part of the ITPACK
project of the Center for Numerical Analysis at The University of Texas
at Austin \cite{itpackv2c,itpack2c,future}.
\smallskip
 
   Some of the salient features of the package are as follows:
\begin{itemize}
\item
 Accelerators for the nonsymmetrizable case such as ORTHOMIN, 
 GCR (Generalized Conjugate Residual), Lanczos, and Paige
 and Saunders' LSQR method are provided, as well as accelerators
 for the symmetrizable case such as conjugate gradient and 
 Chebyshev acceleration.
 
\item
 Basic preconditioners such as Jacobi, Incomplete LU 
 Decomposition, Modified Incomplete LU Decomposition, 
 Successive Overrelaxation, and Symmetric Successive 
 Overrelaxation are included in the package.  Furthermore, 
 some of these preconditioners are available either as 
 left-, right-, or two-sided preconditioners.  (See
 Section~\ref{bbprecons} for additional details.)
 
\item
 The package is modular.  Any preconditioner may be used with
 any accelerator, except for SOR which is unaccelerated.  Any
 preconditioner may be used with any of the allowable data
 storage formats, except for block preconditioners which are
 available only with the diagonal data structures.
 
\item
 Several matrix storage schemes are available.  The user may
 represent the matrix in one of several sparse matrix formats:
 primary storage (the same format used in the ELLPACK package
 \cite{ELLPACK} from Purdue University),
 symmetric diagonal storage, nonsymmetric diagonal storage,
 symmetric coordinate storage, or nonsymmetric coordinate
 storage.
 
\item
 The package can be used in a matrix-free mode, in which the user
 supplies customized routines for performing matrix operations.
 
\item
 The data structures have been chosen for efficiency on vector,
 or pipelined, computers.  Many of the preconditioners have been
 vectorized to work efficiently on such computers as the
 Cyber 205, Cray 1, and Cray X-MP.  It is expected that the
 package would perform well on the Hitachi, Fujitsu and NEC 
 supercomputers.
\end{itemize}

   One of the purposes for the development of the NSPCG package
was to investigate the suitability of various basic iterative
methods for vector computers.  The degree of vectorization that
is possible for a particular iterative method often depends on the
underlying structure of the matrix, the data structure used for
the matrix representation, the ordering used for the equations,
and the vector computer being used.  NSPCG allows several sparse
matrix data structures suitable for matrices with regular or
irregular structures.  Also, various orderings can be given to
the equations to enhance vectorization of the basic iterative
method.  Finally, the package has been written so that it can
be installed on supercomputers with different vectorizing 
philosophies (e.g., memory-to-memory computations as with the 
Cyber 205 computer or register-to-register computations as with 
the Cray computers).
 
   Another purpose for the development of the NSPCG package was
to provide a common modular structure for research on iterative
methods.  In addition to providing a large number of preconditioners
and accelerators, the package has been constructed to facilitate the
addition of new preconditioners and new acceleration schemes
into the package.  Thus, the package is an experimental research
tool for evaluating certain aspects of iterative methods and can
provide a guide for the construction of an iterative algorithm
which is tailored to a specific problem.  The code is not intended
to be used in production situations, but it can provide information
on
\bigskip
\begin{itemize}
  \item the convergence or nonconvergence of an iterative method
  \item the number of iterations required for convergence
  \item the existence of a preconditioner (e.g., incomplete
         factorization preconditioners)
  \item the suitability of an approximate stopping test in 
         reference to an idealized stopping test
  \item the effectiveness of certain adaptive procedures for
         selecting iterative parameters
  \item some vectorization techniques for various iterative
         kernels using certain data structures
\end{itemize}
\bigskip
This information can then be used in designing a production
iterative code.  

\newpage
\section{Usage}
\label{usage}
\indent
 
   The calling sequence for the package is
\bigskip
 
\noindent
{\tt CALL NSPCG ($\la$~{\em precon}~$\ra$ ,
    $\la$~{\em accel}~$\ra$ ,NDIM,MDIM,N,MAXNZ,COEF,JCOEF,P,IP,U,UBAR,RHS, \\
     \hspace*{2.5in} WKSP,IWKSP,NW,INW,IPARM,RPARM,IER) }
 
\bigskip
\noindent
where
\bigskip
\begin{list}{}{\labelwidth 0.70in
               \leftmargin 1.00in \rightmargin 0.25in}
 \item[$\la$~{\em precon}~$\ra$ \hfill]
           is the name of a preconditioning routine which must
           be declared EXTERNAL in the user's calling routine.  It
           must be one of a certain number of choices given in
           Section~\ref{precons}. [name]
 
 \item[$\la$~{\em accel}~$\ra$ \hfill]
           is the name of an acceleration routine which must
           be declared EXTERNAL in the user's calling routine.  It
           must be one of a certain number of choices given in
           Section~\ref{accels}. [name]
 
 \item[NDIM \hfill]
           is the row dimension of the COEF array (See 
           Section~\ref{storage} on Storage Modes for more information).
           [integer, input]
 
 \item[MDIM \hfill]
           is the column dimension of the COEF array (See 
           Section~\ref{storage} on Storage Modes for more information).
           [integer, input]
 
 \item[N \hfill]
           is the order of the linear system.  [integer, input]
 
 \item[MAXNZ \hfill]
           is the active column width of the COEF array (See 
           Section~\ref{storage} on Storage Modes for more information).
           [integer, input/output]
 
 \item[COEF \hfill]
           is a real array containing the matrix nonzero coefficients
           in one of several possible formats to be given in
           Section~\ref{storage}. [real, input/output]
 
 \item[JCOEF \hfill]
           is an integer array containing auxiliary matrix nonzero
           information corresponding to COEF. [integer, input/output]
 
 \item[P \hfill]
           is an integer vector.  If no permutation is to be applied
           to the matrix, then P is unused and may be dimensioned
           to be of length one.
           If the matrix is to be permuted into another
           form, such as red-black, P is a coloring vector of length N
           indicating the integer coloring number of each equation.
           On output, P is then the permutation vector corresponding
           to the multi-color ordering.  (See Sections~\ref{rb} and 
           \ref{mc} for more details.) [integer, input/output]
 
 \item[IP \hfill]
           is an integer vector.  If no permutation is to be applied
           to the matrix, then IP is unused and may be dimensioned
           to be of length one.
           If the matrix is to be permuted, IP is an integer workspace
           vector of length N on input and the inverse permutation
           vector corresponding to P on output.  (See Sections~\ref{rb}
           and \ref{mc} for more details.) [integer, output]
 
 \item[U \hfill]
           is a real vector of length N containing the current
           estimate of the solution to the linear system on
           input.  The user must supply an initial guess, which
           can be all zeros if no information is known.  On
           output, U contains the solution estimate computed by
           the package. [real, input/output]
 
 \item[UBAR \hfill]
           is a real vector of length N containing the exact solution
           to the linear system, if known.  This vector is used
           only if the exact stopping test is used.  Otherwise,
           it may be dimensioned to be of length one.  (See
           Section~\ref{stop} for information on the available 
           stopping tests.) [real, input]
 
 \item[RHS \hfill]
           is a real vector of length N containing the right-hand-side
           of the matrix problem. [real, input]
 
 \item[WKSP \hfill]
           is a real vector of length NW used for workspace.
           [real, input]
 
 \item[IWKSP \hfill]
           is an integer vector of length INW used for workspace.
           [integer, input]
 
 \item[NW \hfill]
           is an integer variable indicating the length of the
           WKSP vector on input and the amount used on output.
           If insufficient real workspace is provided, NW is
           set on output to the amount of workspace needed
           at the point execution terminated. (See Section~\ref{nw}
           for suggested initial values.) [integer, input/output]
 
 \item[INW \hfill]
           is an integer variable indicating the length of the
           IWKSP vector on input and the amount used on output.
           If insufficient integer workspace is provided, INW is
           set on output to the amount of workspace needed
           at the point execution terminated. (See Section~\ref{nw}
           for suggested initial values.) [integer, input/output]
 
 \item[IPARM \hfill]
           is an integer vector of length $25$ giving various
           integer parameters which are described in Section
           ~\ref{params}. [integer, input/output]
 
 \item[RPARM \hfill]
           is a real vector of length $16$ giving various real
           parameters which are described in Section~\ref{params}.
           [real, input/output]
 
 \item[IER \hfill]
           is an error flag.  A nonzero value on output indicates
           a fatal or warning error was detected during the
           iteration.  (See Section~\ref{ier} for a list of error 
           codes.) [integer, output]
\end{list}
 
\newpage
\section{Choices for Preconditioner}
\label{precons}
\indent
 
   The naming convention used for $\la$~{\em precon}~$\ra$,
the routine specifying the preconditioner, indicates both the
preconditioner and the data storage format.  $\la$~{\em precon}~$\ra$
must be declared EXTERNAL in the user's calling program and has
the form $\la$~{\em name}~$\ra i$ where
 
\bigskip
\begin{tabular}{rl}
        $i = 1$ & for primary storage format  \\
        $  = 2$ & for symmetric diagonal storage        \\
        $  = 3$ & for nonsymmetric diagonal storage     \\
        $  = 4$ & for symmetric coordinate storage      \\
        $  = 5$ & for nonsymmetric coordinate storage   \\
        $  = 6$ & for permuted primary storage format   \\
        $  = 7$ & for permuted diagonal storage (symmetric
                    or nonsymmetric)
\end{tabular}
 
\bigskip
\noindent
(See Section~\ref{storage} for a detailed description of these
storage schemes.)

If $i = 1,2,$ or $3$, the parameters P and IP in the calling sequence
are not used and therefore can be dimensioned to be of length one.
If $i=6$ or $7$, P must contain a user-supplied coloring vector
indicating how the equations and unknowns are to be ordered prior
to permuting the system.  Preconditioners with names ending with
$4$ or $5$ may be used with or without a permutation.
 
    The available choices for
$\la$~{\em precon}~$\ra$ are:
 
\bigskip
\begin{tabular}{lll}
 RICH$i$  & Richardson's method
          & $(i=1,2,3,4,5)$                            \\
 JAC$i$   & Jacobi method
          & $(i=1,2,3,4,5)$                            \\
 LJAC$i$  & Line Jacobi method
          & $(i=2,3)$                                  \\
 LJACX$i$ & Line Jacobi method (approx. inverse)
          & $(i=2,3)$                                  \\
 SOR$i$   & Successive Overrelaxation
          & $(i=1,2,3,6,7)$                            \\
 SSOR$i$  & Symmetric SOR
          & $(i=1,2,3,6,7)$                            \\
 IC$i$    & Incomplete Cholesky
          & $(i=1,2,3,6)$                              \\
          & (Note:  IC7 = BIC7 or BICX7)
          &                                            \\
 MIC$i$   & Modified Incomplete Cholesky
          & $(i=1,2,3,6)$                              \\
          & (Note:  MIC7 = MBIC7 or MBICX7)
          &                                            \\
 LSP$i$   & Least Squares Polynomial
          & $(i=1,2,3,4,5)$                            \\
 NEU$i$   & Neumann Polynomial
          & $(i=1,2,3,4,5)$                            \\
 LSOR$i$  & Line SOR
          & $(i=2,3)$                                  \\
 LSSOR$i$ & Line SSOR
          & $(i=2,3)$                                  \\
 LLSP$i$  & Line Least Squares Polynomial
          & $(i=2,3)$                                  \\
 LNEU$i$  & Line Neumann Polynomial
          & $(i=2,3)$                                  \\
 BIC$i$   & Block Incomplete Cholesky (ver. 1)
          & $(i=2,3,7)$                                \\
 BICX$i$  & Block Incomplete Cholesky (ver. 2)
          & $(i=2,3,7)$                                \\
 MBIC$i$  & Modified Block Incomplete Cholesky (ver. 1)
          & $(i=2,3,7)$                                \\
 MBICX$i$ & Modified Block Incomplete Cholesky (ver. 2)
          & $(i=2,3,7)$                                \\
 RS$i$    & Reduced System Method
          & $(i=6,7)$
\end{tabular}
 
\bigskip
\noindent
(See Section~\ref{bbprecons} for a brief description of each of
these preconditioners.)
\newpage
\section{Choices for Accelerator}
\label{accels}
\indent
 
   A large collection of accelerators is available to handle both
symmetric and nonsymmetric matrix problems.  A list of the
available choices for $\la$~{\em accel}~$\ra$ is given below
with their corresponding common names given in parenthesis.
Any $\la$~{\em precon}~$\ra$ entry can be used with any
$\la$~{\em accel}~$\ra$ entry except as noted.  Also, any
accelerator allows left, right, or split orientation of the
preconditioner except as noted.
     
\bigskip
\begin{list}{}{\labelwidth 0.75in
               \leftmargin 1.00in \rightmargin 0.25in}
\item[CG \hfill](Conjugate Gradient acceleration):
      This is the two-term form of conjugate gradient for
      symmetric and positive definite (SPD) matrices and mildly
      nonsymmetric matrices.  Only left preconditioning is
      allowed.
 
\item[SI \hfill](Chebyshev acceleration or Semi-Iteration):
      This is the two-term form of Chebyshev acceleration for
      the same general class of matrices as CG.  Only left
      preconditioning is allowed.
 
\item[SOR \hfill](Successive Overrelaxation):
      This routine must be used as $\la$~{\em accel}~$\ra$
      for the SOR algorithm with adaptive parameter determination.  
      It is intended for SPD and mildly
      nonsymmetric matrices.  For this choice of 
      $\la$~{\em accel}~$\ra$, $\la$~{\em precon}~$\ra$
      is restricted to SOR$i$ and LSOR$i$ for allowed values 
      of $i$.  Also, these choices of $\la$~{\em precon}~$\ra$ 
      cannot be used with other accelerations since SOR cannot 
      be accelerated.  Note that Successive Overrelaxation is 
      not an acceleration but is included here since the
      function of the SOR module is analogous to that of an
      $\la$~{\em accel}~$\ra$ module.
 
\item[SRCG \hfill](SSOR CG Algorithm):
      This algorithm performs the SSOR conjugate gradient
      algorithm with an adaptive procedure for $\om$.  It is 
      intended for SPD and mildly nonsymmetric matrices. 
      $\la$~{\em precon}~$\ra$ is restricted to SSOR$i$ and 
      LSSOR$i$ for allowed values of $i$.  These selections 
      for $\la$~{\em precon}~$\ra$ can also be used with other 
      accelerators, but no adapting on $\om$ will be performed.
      Only left preconditioning is allowed.  Note that the SRCG 
      module combines both an accelerator and a preconditioner 
      but is included here since its function is analogous to 
      that of an $\la$~{\em accel}~$\ra$ module.
 
\item[SRSI \hfill](SSOR SI Algorithm):
      This algorithm performs the SSOR semi-iteration
      algorithm with an adaptive procedure for $\om$.  It is 
      intended for SPD and mildly nonsymmetric matrices.  
      $\la$~{\em precon}~$\ra$ is restricted to SSOR$i$ and 
      LSSOR$i$ for allowed values of $i$.  Only left 
      preconditioning is allowed.  Note that the SRSI module 
      combines both an accelerator and a preconditioner but is 
      included here since its function is analogous to that of 
      an $\la$~{\em accel}~$\ra$ module.
 
\item[BASIC \hfill](Basic Iterative Method):
      This algorithm runs the unaccelerated iterative method
      corresponding to the given preconditioner.  The default 
      extrapolation factor used in BASIC is $2$/(EMAX+EMIN) where 
      EMAX and EMIN are RPARM variables.  To run the unaccelerated
      method with no extrapolation, EMAX and EMIN should be
      set to $1$.  Only left preconditioning is allowed.
      Note that the BASIC module is not an accelerator
      but is included here since its function is analogous 
      to that of an $\la$~{\em accel}~$\ra$ module.
 
\item[ME \hfill](Minimal Error Algorithm):
      This is an ORTHODIR-like algorithm for symmetric systems
      with a three term recurrence which minimizes 
      $\|Q_R(u^{(n)}-A^{-1}b)\|_2$, the $2$-norm of the error, 
      at each iteration \cite{ME}.
 
\item[CGNR \hfill](Conjugate Gradient applied to the Normal Equations):
      This implementation of CG applied to the normal equations
      of the preconditioned system minimizes 
      $\|Q_L^{-1}(b-Au^{(n)})\|_2$, the $2$-norm of the residual 
      vector of the original preconditioned system, at each iteration 
      \cite{LSQR}.  Only left preconditioning is allowed.
 
\item[LSQR \hfill](Least Squares Algorithm):
      This algorithm calculates the same iterates as CGNR but in
      a slightly more stable fashion \cite{LSQR}.
      Only left preconditioning is allowed.
 
\item[ODIR \hfill](ORTHODIR):
      This is a truncated/restarted method useful for nonsymmetric
      systems of equations \cite{ORTHO}.  The user can specify
      the number of old vectors used in the truncation and
      the restarting frequency.
 
\item[OMIN \hfill](ORTHOMIN):
      This is a common truncated/restarted method used for
      nonsymmetric systems \cite{ORTHO}.  Note that this method
      includes the popular GCR method (Generalized Conjugate
      Residual or restarted ORTHOMIN) \cite{GCR}.
 
\item[ORES \hfill](ORTHORES):
      This is another truncated/restarted method for nonsymmetric
      systems \cite{ORTHO}.
 
\item[IOM \hfill](Incomplete Orthogonalization Method):
      This is a truncated method due to Saad \cite{IOM1,IOM2}
      which calculates the same iterates, in exact arithmetic, as
      ORTHORES.  In the symmetric case, it runs the SYMMLQ
      algorithm of Paige and Saunders \cite{SYMMLQ}.
      Only left preconditioning is allowed.
 
\item[GMRES \hfill](Generalized Minimal Residual Method):
      This method is a truncated/restarted method which, in the
      case of pure restarting (NS2 $\leq$ NS1 $+1$ where NS1
      and NS2 are IPARM quantities), calculates
      the same iterates as the truncated/restarted ORTHODIR algorithm
      \cite{GMRES}.  In the symmetric case, it may be used to run
      the MINRES algorithm of Paige and Saunders \cite{SYMMLQ}.
 
\item[USYMLQ \hfill](Unsymmetric LQ):
      This is a quasi-Krylov method useful for nonsymmetric
      systems \cite{Yip}.  Only left preconditioning is allowed.
 
\item[USYMQR \hfill](Unsymmetric QR):
      This is a companion algorithm to 
      USYMLQ.  It minimizes the $2$-norm of the residual over
      an appropriate quasi-Krylov space \cite{Yip}.
      Only left preconditioning is allowed.
 
\item[LANDIR \hfill](Lanczos/ORTHODIR):
      This is the first of the three variants of the Lanczos algorithm
      for nonsymmetric systems \cite{LANCZOS}.
      Only left preconditioning is allowed.
 
\item[LANMIN \hfill](Lanczos/ORTHOMIN or Biconjugate Gradient Method):
      This second variant of the Lanczos algorithm does slightly
      less work per iteration than the other variants of the 
      Lanczos method and is often referred to as the Biconjugate
      gradient method \cite{LANCZOS}.
 
\item[LANRES \hfill](Lanczos/ORTHORES or ``two-sided" Lanczos Method):
      This method is the third variant of the Lanczos algorithm
      \cite{LANCZOS}.  Only left preconditioning is allowed.
 
\item[CGCR \hfill](Constrained Generalized Conjugate Residual Method):
      This method couples truncated/restarted ORTHOMIN with
      a constrained residual technique described in 
      \cite{Wallis1,Wallis2}.
      At each iteration, the average residual over each
      two-dimensional block is forced to be zero.  The
      variable NBL2D must be appropriately set for this algorithm.

\item[BCGS \hfill](Biconjugate Gradient Squared Method):
      This method is similar to the Biconjugate Gradient Method,
      and in many cases performs better \cite{BCGS}.  Only left
      preconditioning is implemented.
\end{list}
 
\begin{table}

    The permitted $\la$~{\em precon}~$\ra$ and $\la$~{\em accel}~$\ra$ 
combinations are given in the table below.
\bigskip
\begin{center}
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
 \hline 
  &R&J&L&L&S&S&I&M&L&N&L&L&L&L&B&B&M&M&R \\
  &I&A&J&J&O&S&C&I&S&E&S&S&L&N&I&I&B&B&S \\
  &C&C&A&A&R&O&$i$&C&P&U&O&S&S&E&C&C&I&I&$i$ \\
  &H&$i$&C&C&$i$&R& &$i$&$i$&$i$&R&O&P&U&$i$&X&C&C& \\
  &$i$& &$i$&X& &$i$& & & & &$i$&R&$i$&$i$& &$i$&$i$&X& \\
  & & & &$i$& & & & & & & &$i$& & & & & &$i$& \\ \hline
 CG     &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\
 SI     &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 SOR    & & & & &x& & & & & &x& & & & & & & &  \\ 
 SRCG   & & & & & &x& & & & & &x& & & & & & &  \\ 
 SRSI   & & & & & &x& & & & & &x& & & & & & &  \\ 
 BASIC  &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 ME     &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 CGNR   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 LSQR   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 ODIR   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 OMIN   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 ORES   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 IOM    &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 GMRES  &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 USYMLQ &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 USYMQR &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 LANDIR &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 LANMIN &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 LANRES &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 CGCR   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ 
 BCGS   &x&x&x&x& &x&x&x&x&x& &x&x&x&x&x&x&x&x \\ \hline
\end{tabular}
\caption{Permitted $\la$~{\em precon}~$\ra$ and 
         $\la$~{\em accel}~$\ra$ Combinations}
\end{center}
\end{table}

\newpage
\section{Brief Background on Preconditioners}
\label{bbprecons}
\indent
 
   The choice of a preconditioner involves the selection of a
matrix $Q$, called the {\it splitting\/} matrix, such that the
preconditioned system
\[ Q^{-1}Au = Q^{-1}b \]
is better conditioned than the original system, $Au = b$.  Clearly, 
one requirement for $Q$ is that it be easily invertible (i.e., linear 
systems having $Q$ as the coefficient matrix can be solved with much
less effort than solving the original system $Au=b$). To reduce the 
number of iterations, it is also desirable that $Q$ be ``close" to $A$ 
in the sense that the spectral radius of $I-Q^{-1}A$ be small. Thus,
in choosing a preconditioner, the user must select between methods
which usually perform a large number of ``cheap" iterations or a
small number of ``expensive" iterations.  NSPCG provides a great
number of preconditioning methods to aid the user in selecting a
preconditioner for a production code.

Left preconditioning is illustrated above.  Many nonsymmetric 
accelerators in the package allow other orientations for the 
preconditioner, such as right preconditioning,
\[ (AQ^{-1})(Qu) = b \]
or two-sided preconditioning,
\[ ( Q_L^{-1}AQ_R^{-1}) (Q_Ru) = Q_L^{-1}b \]
in the event that the preconditioner can be split into $Q = Q_LQ_R$ .
The variable IQLR [= IPARM(22)] is set by the user to determine 
the orientation of the preconditioner (left, right, or two-sided)
for those accelerators which allow orientations of the preconditioner
other than left preconditioning.  Left preconditioning is the
default and is available for all accelerators.
 
NSPCG supplies preconditioners which can be classified as ``point" 
or ``line" preconditioners.  Iterative methods are often used in 
the solution of large sparse linear systems which arise from the
discretization of partial differential equations using finite
differences or finite elements.  The solution to the partial
differential equation is approximated on a discrete set of unknowns
associated with a mesh defined on the domain of the equation.  The terms 
``point" and ``line" (or ``block") refer to whether the unknowns are 
updated one at a time, as with a single node in the mesh, or 
several at a time, as with a line of nodes in the mesh. 

Let matrix $A$ be written as
\[ \lp \begin{array}{cccc}
                 A_{11} & A_{12} & \cdots & A_{1n} \\
                 A_{21} & A_{22} & \cdots & A_{2n} \\
                 \vdots & \vdots & \ddots & \vdots \\
                 A_{n1} & A_{n2} & \cdots & A_{nn}
  \end{array} \rp \]
If matrix $A$ is viewed as a point matrix, then $n$ is the order of 
the system and each $A_{ij}$ is a scalar.  In that case, $A$ can be
written as a sum of point matrices
\[ A = D - C_L - C_U \]
where $D$ is a diagonal matrix, $C_L$ is a strictly lower triangular
matrix, and $C_U$ is a strictly upper triangular matrix.  If matrix
$A$ is viewed as a block matrix corresponding to a partitioning $\pi$
of the unknowns, then $n$ is the number of groups in the partition
and each $A_{i,j}$ is itself a matrix.  In that case, $A$ can be
written as a sum of block matrices
\[ A = \pD - \pL - \pU \]
where $\pD$ is a block diagonal matrix, $\pL$ is a
strictly lower triangular block matrix, and $\pU$ is a
strictly upper triangular block matrix.  In NSPCG, the following
restrictions apply to block matrices:
\begin{itemize}
 \item If the matrix is not permuted by NSPCG, then block preconditioners
       can only be used with symmetric or nonsymmetric diagonal
       storage format of the matrix.  In that case, each $A_{ii}$ 
       is required to be a dense banded matrix, and all submatrices 
       must be of equal size. Also, the user must specify the 
       variable KBLSZ [= IPARM(19)].  KBLSZ is the order of the 
       1-D block diagonal submatrices $A_{ii}$ which must be
       constant for all $i$.  

 \item If the matrix is permuted by NSPCG, then block preconditioners
       can be used with any storage format for the matrix.  In that case,
       each $A_{ii}$ is required to be a dense banded matrix, and the
       submatrices can be of unequal size.  In the case of primary
       storage format, each $A_{ii}$ is required to be a diagonal
       matrix.
\end{itemize}
 
In the line versions of the preconditioners, a common operation is the
solution of banded subsystems corresponding to the $A_{ii}$ diagonal
block submatrices.  NSPCG does not employ a cyclic reduction algorithm
to solve such systems, but will attempt to vectorize the solution if
such a system is actually composed of multiple independent subsystems
of the same size.  In this case, the algorithm used is a scalar
forward and back substitution for each individual subsystem, but
with each operation being done on all the subsystems in a vectorizable
way.

\bigskip
\noindent
{\bf POINT PRECONDITIONERS}
\bigskip
\indent
 
   The point preconditioners available in the package are the
following:
\bigskip
\begin{description}
\item[Richardson method (RF):]
      For this method, $Q = I$, so, in effect, this represents
      the null preconditioner.
 
\item[Jacobi method:]
      For this method, $Q = D$, the diagonal of $A$.  If the
      matrix is known to have positive diagonal elements,
      the Jacobi method may be more efficiently applied by
      requesting diagonal scaling of the matrix and applying
      the RF method to the system.
 
\item[Successive Overrelaxation method (SOR):]
      For this method,
      \[ Q = \oom D-C_L \]
      The SOR iteration parameter $\om$ can be determined by an 
      automatic adaptive procedure.  This method allows no 
      acceleration.
 
\item[Symmetric Successive Overrelaxation method (SSOR):]
      For this method,
      \[ Q = \frac{1}{2-\om}
          \lp \oom D - C_L\rp \lp \oom D \rp^{-1}\lp \oom D - C_U \rp \]
      SSOR iteration parameter $\om$ can be determined by an
      automatic adaptive procedure if SSOR is used with either
      conjugate gradient or Chebyshev acceleration.  Otherwise,
      a user-supplied fixed $\om$ is required.
 
\item[Incomplete $LU$ Decomposition method (ILU(k)):]
      This preconditioner uses an incomplete $LU$ decomposition of
      the matrix as a preconditioner.  The parameter $k$ denotes
      the level of fill-in which is to be allowed in the
      factorization.  The form of $Q$ is:
      \[
         Q = \lc \begin{array}{ll}
          (\Delta - C_L) \Delta^{-1} (\Delta - C_U)  & \mbox{if Case I} \\
          (\Delta - S) \Delta^{-1} (\Delta - T)      & \mbox{if Case II}
      \end{array} \right.
      \]
      Case I occurs when matrix $A$ has Property A (i.e., is 2-cyclic)
      and $k=0$.  Case II occurs if either condition fails.
      Here, $\Delta$ is a diagonal matrix containing the factorization
      pivots, $S$ is a strictly lower triangular matrix, and $T$ is
      a strictly upper triangular matrix.  
 
      It can be seen that if Case I is true, then a considerable
      savings in storage is possible since only a vector of pivots
      of length N has to be stored.  $Q$ can then be implicitly
      represented from just $\Delta$ and the given matrix elements,
      which are already stored.  If Case II is true, then
      it is necessary to store $T$ as well (and $S$, if $A$ is
      nonsymmetric).
 
      A fill-in level of $k=0$ indicates that no fill-in beyond
      the original matrix pattern of nonzeros is to be allowed
      in the factorization.  For $k=1$, fill-in resulting from
      the original nonzero pattern is allowed but no fill-in
      resulting from this newly-created fill-in is allowed.  In
      general, fill-in at level $k$ results from fill-in from levels
      $0,1,2, \ldots, k-1.$  As $k$ grows, the number of
      iterations should decrease but at the expense of increased
      storage and time per iteration.
 
\item[Modified Incomplete $LU$ Decomposition method (MILU(k)):]
      This factorization is similar to the \newline ILU(k) preconditioner
      except that the diagonal pivots of the factorization are
      adjusted so that $Q-A$ has zero row sums.  For many matrices,
      this requirement produces a better condition number for
      $Q^{-1}A$ than for the ILU(k) preconditioner.  Also, this
      requirement forces $Q^{-1}A$ to have at least one eigenvalue
      equal to one.  As in the previous preconditioner, a variable 
      level of fill-in is allowed.
 
\item[Neumann Polynomial method:]
      For this method, a truncated Neumann series approximation
      to $A^{-1}$ is used.  The user can specify the degree of the
      polynomial to be used.  Assuming $A$ is written as $A=D-C$,
      where $D$ is the diagonal of $A$ and $C=C_L+C_U$, 
      then $A=(I-CD^{-1})D$  so
 
\begin{eqnarray*}
        A^{-1} & = & D^{-1}(I-CD^{-1})^{-1} \\
               & = & D^{-1}[I+CD^{-1}+(CD^{-1})^2 + \cdots ]
\end{eqnarray*}
 
      A truncated form of this series is then used for $Q^{-1}$.
      $Q^{-1}$ is not explicitly computed; $Q^{-1}p$ is
      evaluated for a vector $p$ by using a sequence of matrix-vector
      multiplications.  This method will be effective if the spectral 
      radius of $CD^{-1}$ is less than one.
 
\item[Least Squares Polynomial method:]
      For this method, a least squares polynomial is used to
      approximate the inverse of $A$:
      \[ Q^{-1} = p_s(A) \approx A^{-1} \]
      Since it is desired that the spectral radius of the iteration 
      matrix $G=I-Q^{-1}A$ be small, $p_s$ is selected so that the 
      function $f(x)=1-p_s(x)x$ has minimum $2$-norm over a domain 
      which contains the spectrum of $A$.  Note that $G=f(A)$ is the 
      iteration matrix.  This preconditioner is effective only if
      $A$ is SPD (symmetric and positive definite) or nearly so,
      in which case, the domain is $[0,\| A \| _\infty]$.
 
\item[Reduced System method (RS):]
      This method first requires that the system $Au=b$
      be permuted to a red-black system:
      \[
        \lp \begin{array}{cc}
                   D_R & H \\
                   K & D_B   \end{array} \rp
        \lp \begin{array}{c}
                   u_R \\
                   u_B       \end{array} \rp =
        \lp \begin{array}{c}
                   b_R \\
                   b_B       \end{array} \rp
      \]
      where $D_R$ and $D_B$ are diagonal matrices.  This will only
      be possible if matrix $A$ has Property A \cite{Young2}.  The 
      black unknowns can be eliminated from the system to yield the
      reduced system:
      \[
         (D_R - H D_B^{-1} K) u_R =  b_R - H D_B^{-1} b_B
      \]
      which becomes the new matrix problem to be solved.  Once
      $u_R$ has converged to an answer, $u_B$ is found by
      \[
          u_B  = D_B^{-1} ( b_B - K u_R )
      \]
      The reduced system preconditioner is $D_R$.  Note that the
      reduced system is not explicitly computed with this
      method.  However, NSPCG contains a facility for explicitly
      computing the reduced system and applying any of the
      package preconditioners to it.  See Section~\ref{rb}
      for more details.
\end{description}
 
\newpage
\noindent
{\bf LINE PRECONDITIONERS}
\bigskip
\indent
 
   The line preconditioners available in the package are the
following:
\bigskip
\begin{description}
\item[Line Jacobi method:]
      For this method, $Q=\pD$, the block diagonal part 
      of $A$.  For many matrices resulting from finite difference
      discretizations of partial differential equations,
      $\pD$ is a tridiagonal matrix.  The solution to the
      preconditioning equation $\pD z = r$
      is accomplished by a recursive forward and back solve.
      If, however, $\pD$ consists of multiple independent
      subsystems of size KBLSZ, NSPCG will perform each step
      of the factorization and solution process across all
      the subsystems in an independent manner.  This method
      will vectorize on computers allowing constant stride
      vector operations.
 
\item[Line Jacobi (approximate inverse) method:]
      This method uses a banded approximation to the inverse of
      $\pD$.  The solution of $\pD z=r$ is accomplished by
      \[
                  z = \lb \lp \pD\rp^{-1}\rb r
      \]
      where the $[\cdot ]$ operator indicates a truncation of the
      matrix to a banded system.  The variable LTRUNC [= IPARM(17)]
      is used to determine the half-bandwidth to be used
      in the truncation.  If $\pD$ is diagonally dominant, then
      the diagonals of $\lp \pD \rp ^{-1}$ decay at an exponential 
      rate the further the diagonal is away from the main diagonal.
      Hence for diagonally dominant $\pD$, a banded approximation
      to the inverse of $\pD$ will suffice.  Thus, the 
      preconditioning step has been changed from a solve to a 
      matrix-vector multiply, a vectorizable operation.
 
\item[Line SOR (LSOR):]
      For this method, 
      \[  Q=\oom \pD-\pL \]
 
\item[Line SSOR (LSSOR):]
      For this method,
      \[ Q = \frac{1}{2-\om}
          \lp \oom \pD - \pL \rp \lp \oom \pD \rp ^{-1}
          \lp \oom \pD - \pU \rp \]
 
\item[Incomplete Block $LU$ Decomposition, Version 1:]
      For this method, $Q$ takes the form:
      \[ Q = \lc \begin{array}{ll}
          \lp M-\pL \rp M^{-1} \lp M-\pU \rp & \mbox{if Case I} \\
          \lp M-S   \rp M^{-1} \lp M-T   \rp & \mbox{if Case II}
         \end{array} \right.  \]
      Case I occurs when matrix $A$ has block Property A and no
      fill-in of blocks is allowed.  Case II occurs otherwise.
      $M$ here is a block diagonal matrix.  Truncated inverses of
      diagonal pivot block matrices are used in the construction
      of the factorization.  We illustrate the factorization
      process in the case that $A$ is block tridiagonal:
      \[
         A = \lp \begin{array}{ccccc}
           D_1 & U_1 &        &         &         \\
           L_1 & D_2 & U_2    &         &         \\
               & L_2 & \ddots & \ddots  &         \\
               &     & \ddots & D_{n-1} & U_{n-1} \\
               &     &        & L_{n-1} & D_n     \end{array} \rp
      \]
      Then $A$ has block Property A, so Case I is used.  Thus we seek
      a factorization of the form \newline
      \mbox{$\lp I-\pL M^{-1} \rp\lp M-\pU \rp$} or
      \[ \lp \begin{array}{ccccc}
          I           &             &        &                    & \\
          L_1M_1^{-1} & I           &        &                    & \\
                      & L_2M_2^{-1} & I      &                    & \\
                      &             & \ddots & \ddots             & \\   
                      &             &        & L_{n-1}M_{n-1}^{-1}& I
          \end{array} \rp \lp \begin{array}{ccccc}
          M_1 & U_1 &     &        &         \\
              & M_2 & U_2 &        &         \\
              &     & M_3 & \ddots &         \\
              &     &     & \ddots & U_{n-1} \\
              &     &     &        & M_n 
          \end{array} \rp \]
      Thus an exact block factorization satisfying $A=\lp I-\pL M^{-1}\rp
      \lp M-\pU\rp$ results in the following recursion formula for
      $M_i$: 
       \[ M_i = D_i - L_{i-1} M_{i-1}^{-1} U_{i-1}  
           \hspace*{1.0in} (1 \leq i \leq n) \]
      Since $M_i^{-1}$ is a full block matrix, truncation to a
      banded form is used, resulting in an incomplete block 
      factorization.  Thus,
       \[ M_i = D_i - [ L_{i-1} [M_{i-1}^{-1}] U_{i-1} ] 
           \hspace*{1.0in} (1 \leq i \leq n) \]
 
      where $[\cdot ]$ is a truncation operator to indicate truncation
      to a banded form.  LTRUNC [= IPARM(17)] is used to determine the
      half-bandwidth to be used in the truncations.
 
\item[Incomplete Block $LU$ Decomposition, Version 2:]
      For this method, $Q$ takes the form:
      \[
          Q = \lc \begin{array}{ll}
           \lp M^{-1}-\pL \rp M \lp M^{-1}-\pU \rp   & \mbox{if Case I} \\
           \lp M^{-1}-S   \rp M \lp M^{-1}-T   \rp   & \mbox{if Case II}
         \end{array} \right.
      \]
      $M$ is a block diagonal matrix and Cases I and II are defined
      as above.  Truncated inverses of diagonal pivot block matrices 
      are used in the construction of the factorization.  We illustrate 
      the factorization for a block tridiagonal matrix,
      as given above.   Thus, Case I applies and a factorization of
      the form $\lp I-\pL M \rp\lp M^{-1}-\pU \rp$ is used.
      \[ \lp \begin{array}{ccccc}
          I      &        &        &                & \\
          L_1M_1 & I      &        &                & \\
                 & L_2M_2 & I      &                & \\
                 &        & \ddots & \ddots         & \\   
                 &        &        & L_{n-1}M_{n-1} & I
          \end{array} \rp \lp \begin{array}{ccccc}
          M_1^{-1} & U_1      &          &        &         \\
                   & M_2^{-1} & U_2      &        &         \\
                   &          & M_3^{-1} & \ddots &         \\
                   &          &          & \ddots & U_{n-1} \\
                   &          &          &        & M_n^{-1} 
          \end{array} \rp \]
      Thus an exact block factorization satisfying $A=\lp I-\pL M \rp
      \lp M^{-1}-\pU \rp$ results in the following recursion formula for
      $M_i$: 
      \[ M_i = \lp D_i - L_{i-1} M_{i-1} U_{i-1} \rp^{-1}  
           \hspace*{1.0in} (1 \leq i \leq n) \]
      Since $M_i$ is a full block matrix, truncation to a
      banded form is used, resulting in an incomplete block 
      factorization.  Thus,
 
      \[ M_i = [(D_i - [ L_{i-1} M_{i-1} U_{i-1} ])^{-1} ]   
           \hspace*{1.0in} (1 \leq i \leq n) \]
 
       where $[\cdot]$ is the truncation operator.
 
\item[Modified Incomplete Block $LU$ Decomposition, Version 1:]
      For this method, the pivot blocks are modified during the
      factorization so that $Q-A$ has zero row sums.  The general
      form for $Q$ is the same as for the unmodified version.
 
      To force $(Q-A)e=0$ where $e^T=(1,1,\ldots,1)$ (i.e., $Q-A$ has
      zero row sums), we set
      \[ Q=\lp M-\pL \rp M^{-1}\lp M-\pU \rp \]
      where
      \[ M_i=D_i-[L_{i-1}[M_{i-1}^{-1}]U_{i-1}]+\Lambda_i 
           \hspace*{1.0in} (1 \leq i \leq n) \]
      and 
      \[ \Lambda_i = \mbox{diag} \lc [L_{i-1}[M_{i-1}^{-1}]U_{i-1}]e
                  -  L_{i-1}M_{i-1}^{-1}U_{i-1}e \rc \]
 
\item[Modified Incomplete Block $LU$ Decomposition, Version 2:]
      For this method, the pivot blocks are modified during the
      factorization so that $Q-A$ has zero row sums.  The general
      form for $Q$ is the same as for the unmodified version 2.
 
      To force $(Q-A)e = 0$, we set
      \[ Q=\lp M^{-1}-\pL \rp M \lp M^{-1}-\pU \rp \]
      where
      \[ M_i = \tilde{M}_i + N_i \]
      \[ \tilde{M}_i = [( D_i-[L_{i-1} M_{i-1} U_{i-1}] )^{-1} ] 
           \hspace*{1.0in} (1 \leq i \leq n) \]
       and $N_{i}$ is a diagonal matrix determined by the equation
      \[ N_i c_i  = e - \tilde{M}_i c_i \]
      where
      \[ c_i = D_i e - L_{i-1} M_{i-1} U_{i-1} e \]
 
\item[Line Neumann Polynomial method:]
      This is a line polynomial preconditioners which attempts to
      approximate $A^{-1}$ as a polynomial in $\pC\lp\pD\rp^{-1}$ 
      where $\pC=\pL+\pU$ and $A=\pD-\pC$.  Then
      \[ A = \lb I - \pC (\pD)^{-1}\rb \pD \]
      so
 
      \begin{eqnarray*}
        A^{-1} & = & (\pD)^{-1} 
                    \lc I - \pC(\pD)^{-1}\rc^{-1}    \\
          & = & (\pD)^{-1} \lc I + \pC(\pD)^{-1} + 
              \lb \pC(\pD)^{-1}\rb^2  + \cdots \rc
      \end{eqnarray*}
 
      A truncation of this series is then used as the Neumann
      polynomial approximation for the inverse of $A$.
 
\item[Line Least Square Polynomial method:]
      For this method,
      \[ Q^{-1} = p_s \lb(\pD)^{-1}A \rb \approx A^{-1} \]
      for some polynomial of degree $s$.  Since it is desired that 
      the spectral radius of the iteration matrix $G=I-Q^{-1}A$
      be small, $p_s$ is selected so that the function $f(x)=1-p_s(x)x$
      has minimum $2$-norm over a domain which contains the
      spectrum of $(\pD)^{-1}A$.
 
\item[Line Reduced System method:]
      This method requires that matrix $A$ have line Property A
      and be given a line red-black ordering.  Otherwise,
      the details are the same as for point reduced system methods.
      Thus, the system $Au=b$ is permuted to
      \[ \lp \begin{array}{cc}
              \pD_R     & H^{(\pi)} \\
              K^{(\pi)} & \pD_B     \end{array} \rp
         \lp \begin{array}{c}
              u^{(\pi)}_R \\
              u^{(\pi)}_B \end{array} \rp =
         \lp \begin{array}{c}
              b^{(\pi)}_R \\
              b^{(\pi)}_B \end{array} \rp \]
      where $\pD_R$  and $\pD_B$ are block diagonal matrices.
\end{description}
 
\newpage
\section{Brief Background on Accelerators}
\label{bbaccels}
\indent
 
    Each accelerator in the package applies itself to the 
preconditioned system
\[ (Q_L^{-1}AQ_R^{-1})(Q_R u) = (Q_L^{-1}b) \] 
which we will represent abstractly as
\[ \sA \su = \sb \] 

    The package contains two classes of accelerators.  The first class,
comprising the accelerators CG, SI, SRCG, SRSI and SOR, is best suited 
for the symmetrizable case, that is, when the matrices $A$ and
$Q\equiv Q_LQ_R$ are symmetric positive definite (SPD).  These symmetric 
methods allow only left preconditioning, and they are designed to take 
full advantage of the symmetry of $A$ and $Q$.

    The second class comprises those accelerators designed to work
with more general problems than the symmetrizable case.  These
accelerators in many cases allow for right and two-sided as well
as left preconditioning.  They are best understood and will be
discussed here in terms of the system involving the abstract matrix
$\sA$.

    Before giving practical usage considerations for the accelerators, 
we will give a brief description of the solution technique of each 
method.  Each acceleration procedure is basically defined by two 
criteria:
\begin{enumerate}
 \item The solution is to be in some space, typically a shifted 
       Krylov space, and 
 \item The solution is selected from this space by some condition, 
       typically an orthogonality condition.
\end{enumerate}
Here the Krylov space is given by 
\[ K_n(v,A)={\rm span}\{ A^i v\} _{i=0}^{n-1}. \]

    For certain conditions on $A$, $Q_L$ and $Q_R$, the orthogonality 
condition is equivalent to a minimization property over the space.  
That is, $u^{(n)}$ is chosen to minimize the quantity
\[ \la u^{(n)}-\bar u,ZQ^{-1}A(u^{(n)}-\bar u)\ra \]
where $\bar u$ denotes the true solution to the original problem $Au=b$, 
and $\la\cdot ,\cdot \ra$ denotes the standard inner product.  The 
matrix $Z$ depends on the method.

\smallskip
    The following table gives a summary of the two criteria for
each accelerator.  For simplicity, the effects of truncation and
restarting are neglected.

\medskip
\begin{center}
\begin{tabular}{|c|c|c|c|} \hline
Accelerator & Space & Orthogonality Condition & $Z$ \\ \hline
CG, SI & $u^{(n)}\in u^{(0)}+K_n(Q^{-1}r^{(0)},Q^{-1}A)$
       & $r^{(n)}\bot K_n(Q^{-1}r^{(0)},Q^{-1}A)$
       & $Q$ \\ \hline
ME     & $\su^{(n)}\in \su^{(0)}+K_n(\sA\sr^{(0)},\sA)$
       & $\sr^{(n)}\bot K_n(\sr^{(0)},\sA)$ ($\sA$ symm.)
       & $Q_R^TQ_R A^{-1}Q$ \\ \hline
CGNR,LSQR & $\su^{(n)}\in \su^{(0)}+K_n(\sA^T\sr^{(0)}, \sA^T\sA)$
          & $\sA^T\sr^{(n)}\bot K_n(\sA^T\sr^{(0)}, \sA^T\sA)$
          & $A^TQ_L^{-T}Q_R$ \\ \hline
ORES,IOM  & $\su^{(n)}\in \su^{(0)}+K_n(\sr^{(0)},\sA)$
          & $\sr^{(n)}\bot K_n(\sr^{(0)},\sA)$
          & $Q_R^TQ_R$ \\ \hline
ODIR,OMIN,GMRES 
          & $\su^{(n)}\in \su^{(0)}+K_n(\sr^{(0)},\sA)$
          & $\sr^{(n)}\bot \sA K_n(\sr^{(0)},\sA)$
          & $A^TQ_L^{-T}Q_R$ \\ \hline
USYMLQ    & $\su^{(n)}\in \su^{(0)}+\widetilde K_n(\sr^{(0)},\sA^T)$
          & $\sr^{(n)}\bot \widetilde K_n(\sr^{(0)},\sA)$
          & $Q_R^TQ_R$ \\ \hline
USYMQR    & $\su^{(n)}\in \su^{(0)}+\widetilde K_n(\sr^{(0)},\sA^T)$
          & $\sr^{(n)}\bot \sA\widetilde K_n(\sr^{(0)},\sA^T)$
          & $A^TQ_L^{-T}Q_R$ \\ \hline
LANDIR/MIN/RES 
          & $\su^{(n)}\in \su^{(0)}+K_n(\sr^{(0)},\sA)$
          & $\sr^{(n)}\bot K_n(\sr^{(0)},\sA^T)$
          & $Q_R^TQ_R$ \\ \hline
\end{tabular}
\end{center}
\bigskip

\noindent Here the $n$-dimensional quasi-Krylov space is defined by
\[ \widetilde K_n(v,A) = {\rm span} \{ v,Av,AA^Tv,AA^TAv,\ldots \} \]
The residual of the abstract system is given by 
$\sr^{(n)}=\sb-\sA\su^{(n)}$, whereas $r^{(n)}=b-Au^{(n)}$ is the residual
of the original system.

A few comments are in order regarding the nonsymmetric accelerators.  
ORTHODIR and ORTHOMIN always have a minimization property, for any $A$, 
$Q_L$ and $Q_R$.  ORTHORES, however, may not.  In the case of $\sA$ 
symmetric, ORTHODIR and ORTHOMIN reduce to the conjugate residual method 
for $\sA$, and ORTHORES reduces to the 3-term conjugate gradient
method applied to $\sA$.

The Lanczos methods all utilize the same ``auxiliary vector'' 
$\tilde\sr^{(0)} = \sr^{(0)}$.  Thus when $\sA$ is
SPD they all give the same iterates as the conjugate gradient method
applied to $\sA$.

Now we present guidelines for determining which accelerator to use in
practical situations.  If both $A$ and $Q$ are SPD, the
symmetric accelerators listed above may be used.  Otherwise we proceed
as follows.

    We consider four classes of matrix problem arising from the preconditioned
matrix $\sA$:

\begin{enumerate}
  \item $\sA$ is SPD.
  \item $\sA$ is symmetric indefinite.
  \item $\sA$ is definite (i.e., the symmetric part, $(\sA+\sA^T)/2$, 
          or its negative, is SPD.
  \item The general case.
\end{enumerate}

Unfortunately, when $\sA$ is not symmetric (cases 3. and 4. above)
the choice of best accelerator is not at all clear.  However, we will give
some general guidelines below.  
 
\bigskip
\noindent
{\bf The SPD Case:}
\bigskip

    This case often arises from $A$ and $Q$ being SPD, in which case 
the symmetric accelerators such as CG may be used. Otherwise, one may 
use IOM (minimizing the $\sA^{-1/2}$-norm of $\sr^{(n)}$) or ORTHOMIN 
or GMRES (minimizing the $2$-norm of $\sr^{(n)}$).  These should be used
in their truncated forms, with variable NS2 \mbox{[= IPARM(10)]} set to
ITMAX [= IPARM(2)], and NS1 [= IPARM(9)] set to 1 (ORTHOMIN) or 2 
(IOM, GMRES).  
 
\bigskip
\noindent
{\bf The Symmetric Indefinite Case:}
\bigskip

    In the symmetric indefinite case, the methods SYMMLQ and MINRES
may be used.  SYMMLQ may be run by using truncated IOM with NS1$=2$
and NS2=ITMAX.  MINRES may be run by using truncated
GMRES with NS1 and NS2 set the same way.  Of the two, the MINRES 
algorithm is to be preferred, as it minimizes the $2$-norm of the residual
$\sr^{(n)}$ at each iteration.
 
\bigskip
\noindent
{\bf The Definite Case:}
\bigskip

    If it is known that $\sA$ is definite, ORTHOMIN and GMRES
are guaranteed to converge. These methods minimize the $2$-norm of 
the residual $\sr^{(n)}$. The implementation of ORTHOMIN in the 
package runs the algorithm ORTHOMIN(NS1), the truncated method, 
and restarts it every NS2 iterations.  The best settings for NS1 and
NS2 are as follows:

\begin{description} 

\item[{\sl The Mildly Nonsymmetric Case.}]
For a mildly nonsymmetric matrix, NS2 should be set to ITMAX, so that 
the pure truncated method is run.  In this case, NS1$=0$ is the 
steepest descent method, NS1$=1$ is the classical $2$-term conjugate 
residual method, and progressively larger choices of NS1 handle 
nonsymmetry better, but at the expense of greater computation time 
and storage.  A value of NS1 $\geq 1$ is preferred here.

\item[{\sl The Highly Nonsymmetric Case.}]
If the matrix is highly nonsymmetric, NS1 should be set to ITMAX, 
so that the pure restarted method is run.  The variable NS2 denotes 
the restart frequency, and, again, larger values of NS2 perform better 
but require more time and workspace.  If a purely restarted method is 
used (i.e., NS1 $\geq$ NS2), the GMRES accelerator should be preferred 
to ORTHOMIN, as it requires slightly less storage and CPU time and is 
slightly more stable numerically.
\end{description}
 
\bigskip
\noindent
{\bf The General Case:}
\bigskip

For the general nonsymmetric case, the ORTHOMIN and GMRES accelerators
may work but have the potential of breaking down.  The package contains
other methods which are sometimes better.  These include methods
based on the normal equations, Lanczos methods, and the 
Biconjugate Gradient Squared method.  Roughly speaking, if the methods
were listed in terms of increasing reliability (and decreasing overall
efficiency), they would be listed in the order: Lanczos-type methods
(including Biconjugate Gradient Squared), ORTHOMIN-type methods,
and normal equations methods.

\begin{description}

\item[{\sl Normal Equations:}]
For the case of a general matrix, methods based on the normal equations 
corresponding to $\sA$ (i.e., solving $\sA^T\sA \su=\sA^T\sb$) may
be used.  These methods are guaranteed to converge in exact arithmetic 
for any nonsingular system whatsoever; however, in practice convergence 
is usually slow, and roundoff error may preclude convergence altogether.  
The package contains two implementations of conjugate gradients applied
to the normal equations, CGNR and LSQR, the latter of which is more 
stable numerically.  These both minimize the norm of the residual
$\sr^{(n)}$ at each step.

\item[{\sl Lanczos Methods.}]
Another useful accelerator is the Lanczos (Biconjugate
Gradient) method.  This accelerator, making use of the transpose
operator, has a short recurrence, and will converge within N
iterations if the iteration process does not break down.  However, the
iterates are not bounded over all choices of initial residual,
and in fact the method may break down
for nonsymmetric matrices.  In fact, little is known about the
convergence properties of the Lanczos method.  In spite of this,
for some classes of problems the Lanczos method works quite
efficiently.  There are three Lanczos variants in the package,
of which the LANMIN variant is to be preferred.

\item[{\sl The Biconjugate Gradient Squared Method.}]
In many cases, the Biconjugate Gradient Squared (BCGS) \newline method 
may converge faster than LANMIN.  The BCGS method requires two 
matrix multiplies per step and does not require the transpose.  In 
exact arithmetic, it breaks down for roughly the same matrix problems 
as LANMIN does.  Its faster convergence arises from the fact that it 
generates the square of the error-reducing polynomial utilized by 
LANMIN, thus in some sense converging twice as fast as LANMIN.
\end{description}
 
\newpage
\section{Parameter Arrays IPARM and RPARM}
\label{params}
\indent
 
   The user must supply default values for the parameters in IPARM
and RPARM by inserting the line
\begin{verbatim}
       CALL DFAULT (IPARM,RPARM)
\end{verbatim}
in the program before the call to NSPCG.  The user may then assign
nondefault values to selected quantities in IPARM and RPARM by
inserting the appropriate assignment statements before the call to
the iterative routine.
 
   Important variables in this package which may change adaptively
are EMAX and EMIN (eigenvalue estimates of $Q^{-1}A$), OMEGA
(overrelaxation
parameter for the SOR and SSOR methods), ALPHAB and BETAB (SSOR
parameters), and SPECR (estimate of the spectral radius of the SOR
iteration matrix).
 
   The integer vector IPARM and the real vector RPARM allow the user
to control certain parameters which affect the performance of the
iterative algorithms.  Furthermore, these vectors allow the updated
parameters from the automatic adaptive procedures to be communicated
back to the user.  The IPARM and RPARM parameters are described
below.
 
\bigskip
\noindent
{\bf Description of IPARM parameters}
\bigskip
 
\begin{list}{}{\labelwidth 0.9in
               \leftmargin 1.00in \rightmargin 0.25in}
 \item[IPARM(1) \hfill](NTEST):
         The stopping test number, in the range $1$ to $10$, indicates
         which stopping test should be used to terminate the
         iteration.  (See Section~\ref{stop} for a description
         of the available stopping tests.)  The SOR accelerator
         uses a specialized stopping test, so this parameter is
         ignored when SOR is called unless $\mbox{NTEST}=6$, in which
         case the exact stopping test is used.  \mbox{[Default: $2$]}
 
 \item[IPARM(2) \hfill](ITMAX):
         On input, ITMAX is the maximum number of iterations
         allowed. On output, ITMAX is the number of iterations
         performed.  \mbox{[Default: $100$]}
 
\item[IPARM(3) \hfill](LEVEL):
         LEVEL is the output level control switch.  Each higher
         value provides additional information.  \mbox{[Default: $0$]}
 
         \begin{tabular}{ll}
          $< 0$ & no output \\
          $= 0$ & fatal error messages only (default) \\
          $= 1$ & warning messages and minimum output \\
          $= 2$ & reasonable summary \\
          $= 3$ & parameter values and informative comments \\
          $= 4$ & approximate solution after every iteration \\
         \end{tabular}
 
\item[IPARM(4) \hfill](NOUT):
         NOUT is the Fortran unit number for output.  \mbox{[Default: $6$]}
 
\item[IPARM(5) \hfill](IDGTS):
         IDGTS is the error analysis switch.  An analysis of the
         final computed solution is made to determine the
         accuracy.  \mbox{[Default: $0$]}
 
         \begin{tabular}{ll}
          $< 0$ & skip error analysis \\
          $= 0$ & compute DIGIT1 and DIGIT2 and store in RPARM 
                  (default) \\
          $= 1$ & print DIGIT1 and DIGIT2 \\
          $= 2$ & print final approximate solution vector \\
          $= 3$ & print final approximate residual vector \\
          $= 4$ & print both solution and residual vectors
         \end{tabular}
 
         If LEVEL is less than $1$, no printing is done.  See
         explanation of DIGIT1 [= RPARM(7)] and DIGIT2 [= RPARM(8)]
         for more details.
 
\item[IPARM(6) \hfill](MAXADP):
         This parameter is the adaptive procedure switch for EMAX.
         \mbox{[Default: $1$]}
 
         \begin{tabular}{ll}
          $= 0$ & no adapting on EMAX \\
          $= 1$ & adapting on EMAX (default)
         \end{tabular}
 
\item[IPARM(7) \hfill](MINADP):
         This parameter is the adaptive procedure switch for EMIN.
         \mbox{[Default: $1$]}
 
         \begin{tabular}{ll}
          $= 0$ & no adapting on EMIN \\
          $= 1$ & adapting on EMIN (default)
         \end{tabular}
 
\item[IPARM(8) \hfill](IOMGAD):
         This parameter is the adaptive procedure switch for OMEGA.
         \mbox{[Default: $1$]}
 
         \begin{tabular}{ll}
          $= 0$ & no adapting on OMEGA \\
          $= 1$ & adapting on OMEGA (default)
         \end{tabular}
 
\item[IPARM(9) \hfill](NS1):
         NS1 is the number of old vectors to be saved for the
         truncated acceleration methods.  \mbox{[Default: $5$]}
 
\item[IPARM(10) \hfill](NS2):
         NS2 is the frequency of restarting for the restarted
         acceleration methods.  By default, NS2 is set to a large
         value so that restarting is not done.  \mbox{[Default: $100000$]}
 
\item[IPARM(11) \hfill](NS3):
         Used only in ORTHOMIN, NS3 denotes the size of the largest 
         Hessenberg matrix to be used to estimate the eigenvalues; 
         it should be set to some value such as $40$.  \mbox{[Default: $0$]}
 
\item[IPARM(12) \hfill](NSTORE):
         NSTORE indicates the storage mode used. See 
         Section~\ref{storage} for a description of the storage 
         modes.  \mbox{[Default: $2$]}
 
         \begin{tabular}{ll}
          $= 1$ & Primary format \\
          $= 2$ & Symmetric diagonal format (default) \\
          $= 3$ & Nonsymmetric diagonal format \\
          $= 4$ & Symmetric coordinate format \\
          $= 5$ & Nonsymmetric coordinate format
         \end{tabular}
 
\item[IPARM(13) \hfill](ISCALE):
         ISCALE is a switch indicating whether or not the matrix
         should be scaled before iterating and unscaled after
         iterating.  \mbox{[Default: $0$]}
 
         \begin{tabular}{ll}
          $= 0$ & no scaling (default) \\
          $= 1$ & scaling
         \end{tabular}
 
         If scaling is selected, the matrix is scaled to have a unit
         diagonal, and  $u$ and $b$ are scaled accordingly.  If
         NTEST $= 6$, $\bar{u}$ is also scaled.
         If $Au=b$ is the system to be scaled, and $D$ is the diagonal
         of $A$, the scaled system is
         \[
              (D^{-\frac{1}{2}}AD^{-\frac{1}{2}})
                   (D^{\frac{1}{2}}u)=D^{-\frac{1}{2}}b
         \]
         The diagonal elements of the matrix must be positive if
         scaling is requested.  Scaling of the system causes an
         extra N elements of WKSP to be used, so scaling is not
         recommended if storage is a consideration.
 
\item[IPARM(14) \hfill](IPERM):
         IPERM is a switch indicating whether or not the matrix
         should be permuted before iterating and unpermuted after
         iterating.  \mbox{[Default: $0$]}
 
         \begin{tabular}{ll}
          $= 0$ & no permuting (default) \\
          $= 1$ & permuting
         \end{tabular}
 
         If a permutation of the matrix is desired, IPERM must be set
         to $1$, and the vector P of the argument list of NSPCG must
         contain a coloring vector.  See Sections~\ref{rb} and 
         \ref{mc} for more details on
         coloring vectors.  If IPERM $= 0$, the vector P is ignored and
         can be dimensioned to be of length one.  If IPERM $= 1$, a
         permutation vector is generated from P and replaces it, while
         IP contains the inverse permutation vector on output.
         If $Au=b$ is the system to be permuted and P is the
         permutation vector, the permuted system is
         \[
           (PAP^T) (Pu) = Pb
         \]
         If NTEST $= 6$, $\bar{u}$ is permuted in addition to
         $u$ and $b$.
 
\item[IPARM(15) \hfill](IFACT):
         IFACT indicates whether a factorization associated with a
         particular preconditioner should be computed for the
         current NSPCG call or whether a factorization from a
         previous NSPCG call should be used.  For some preconditioners,
         the factorization time can be a significant percentage of
         the iteration time.  If one is making a series of calls to
         NSPCG (as in a time-dependent or nonlinear application) and
         the coefficient matrix is not changing much, it may be
         reasonable to amortize the cost of factorization over several
         NSPCG calls by using a factor from a previous call for the
         current call.  The user must call the same preconditioner
         for all calls, and the structure of $A$ as indicated by JCOEF
         cannot be changing.  \mbox{[Default: $1$]}
 
         \begin{tabular}{ll}
            $1$ & matrix $A$ is to be factored for the current call 
                  (default) \\
            $0$ & matrix $A$ is not to be factored for the current
                  call (previous factorization used)
         \end{tabular}
 
 
\item[IPARM(16) \hfill](LVFILL):
         LVFILL is the level of point or block fill-in to be allowed
         during the factorization.  It affects the preconditioners
 
         \begin{tabular}{ll}
            IC$i$       &   $(i = 1,2,3)$  \\
            MIC$i$      &   $(i = 1,2,3)$  \\
            BIC$i$      &   $(i = 2,3)$    \\
            BICX$i$     &   $(i = 2,3)$    \\
            MBIC$i$     &   $(i = 2,3)$    \\
            MBICX$i$    &   $(i = 2,3)$
         \end{tabular}
 
         If LVFILL $= 0$, no fill-in
         is allowed beyond the original matrix nonzero pattern.  If
         LVFILL $= 1$, fill-in is allowed caused by the original
         nonzero pattern, but no further fill-in caused by these just
         filled-in elements is allowed.  If LVFILL $= 2$, fill-in is
         allowed if it is due to the original pattern or LVFILL $= 1$
         filled-in elements.
 
         As an example of point fill-in, suppose a symmetric
         matrix has diagonals at distances of $0$, $1$, and $s$.
         Then the diagonals of the factorization are:
         \[
         \begin{array}{cl}
          \mbox{LVFILL} & \mbox{Diagonals in factorization} \\
            0 & 0,1,s                           \\
            1 & 0,1,s-1,s                       \\
            2 & 0,1,s-2,s-1,s                   \\
            3 & 0,1,2,s-3,s-2,s-1,s             \\
            4 & 0,1,2,3,s-5,s-4,s-3,s-2,s-1,s
         \end{array}
         \]
         In general, if at LVFILL $= k$, the positive diagonal numbers
         are
         \[
              p_1,p_2, \ldots ,p_m
         \]
         and the negative diagonal numbers are
         \[
              q_1,q_2, \ldots ,q_n
         \]
         then the diagonal numbers at LVFILL $= k+1$ are
         \[
              p_i + q_j  \hspace*{0.5in} (i=1,2, \ldots ,m
                   \mbox{\hspace*{0.1in} and \hspace*{0.1in}}
                             j=1,2, \ldots ,n)
         \]
         In the example above for LVFILL $=0$, then $p_1=1$, 
         $p_2=s$, $q_1=-1$, and $q_2=-s$.  Hence for LVFILL $=1$,
         the diagonal numbers are $0,1,-1,(s-1),-(s-1),s,-s$.
 
         For block factorization methods, LVFILL is the level of
         block fill-in.  For example, if a 7-point finite difference
         stencil is used to discretize a partial differential equation
         on a 3-dimensional box domain, the resulting matrix can be
         regarded as a block pentadiagonal matrix with 1-D blocks
         corresponding to the mesh lines.  The block band is sparse
         and will tend to fill in during a block factorization.  If
         LVFILL is positive, line blocks which are zero in the original
         matrix will be allowed to fill in with a bandwidth equal
         the bandwidth of the diagonal pivot blocks, which in turn
         are determined by LTRUNC.
 
         In general, an increase in LVFILL results in a more accurate
         factorization (and fewer iterations) but at the expense of
         increased storage requirements and possibly more total time.
         \mbox{[Default: $0$]}
 
\item[IPARM(17) \hfill](LTRUNC):
         LTRUNC determines the truncation bandwidth to be used when
         approximating the inverses of matrices with dense banded
         matrices.  It affects the preconditioners LJACX$i$, BIC$i$,
         BICX$i$, MBIC$i$, MBICX$i$ for $i=2,3$.  If the band matrix 
         whose inverse is being approximated has a half-bandwidth of 
         $s$, $s+\mbox{LTRUNC}$ will be the half-bandwidth
         of the approximating inverse.  Thus, LTRUNC is the increase
         in bandwidth to be used for the inverse approximation over
         the bandwidth of the original matrix.  In general, an
         increase in LTRUNC means a more accurate factorization at
         the expense of increased storage.  \mbox{[Default: $0$]}
 
\item[IPARM(18) \hfill](IPROPA):
         IPROPA is a flag to indicate whether or not matrix $A$ has
         Property A.  If a matrix has Property A, it is two-cyclic
         and can be permuted to a red-black matrix.  Also, a
         considerable savings in storage is possible if a factorization
         preconditioner is called.  IPROPA affects the following
         preconditioners:
 
         \begin{tabular}{ll}
            IC$i$    & $(i=1,2,3,6)$ \\
            MIC$i$   & $(i=1,2,3,6)$ \\
            BIC$i$   & $(i=2,3,7)$ \\
            BICX$i$  & $(i=2,3,7)$ \\
            MBIC$i$  & $(i=2,3,7)$ \\
            MBICX$i$ & $(i=2,3,7)$
         \end{tabular}
 
         For the first two preconditioners, IPROPA refers to point
         Property A.  For the last four preconditioners, IPROPA
         refers to block Property A (i.e., whether the matrix
         considered as a block matrix has Property A).  IPROPA
         can assume the following values on input:
 
         \begin{tabular}{lp{3.5in}}
           $= 0$ & if matrix $A$ does not have Property A \\
           $= 1$ & if matrix $A$ has Property A           \\
           $= 2$ & if it is not known whether or not matrix $A$ has \\
                 & \hspace{0.1in} Property A; compute if needed (default)
         \end{tabular}
 
         If IPROPA $= 2$ and LVFILL $= 0$ on input, and one of the
         six methods above is used, it is determined whether or
         not the matrix has property A, and IPROPA is reset to
         $0$ or $1$ accordingly.
 
         Determining if matrix $A$ has Property A requires $2$N
         workspace from IWKSP, so it is advantageous for the
         user to inform NSPCG if it is known beforehand whether
         or not the matrix has Property A.  In general, finite
         element matrices do not have point Property A.  If a
         $5$-point central finite difference stencil is used on
         a two-dimensional self-adjoint PDE, or if a $7$-point
         central finite difference stencil is used on a three-
         dimensional self-adjoint PDE, the resulting matrix has
         Property A.  \mbox{[Default: $2$]}
 
\item[IPARM(19) \hfill](KBLSZ):
         KBLSZ is the $1$-D block size.  It is used in the
         line preconditioners.  KBLSZ is the largest integer such
         that, if matrix $A$ is considered as a block matrix,
         the diagonal blocks have dense bands.  \mbox{[Default: $-1$]}
 
\item[IPARM(20) \hfill](NBL2D):
         NBL2D is the $2$-D block size.  It is used only for the
         CGCR acceleration, which is applied to $3$-D problems on
         a box domain.  \mbox{[Default: $-1$]}
 
\item[IPARM(21) \hfill](IFCTV):
         IFCTV is a switch for indicating whether a scalar or a
         vectorized routine is to be used for the incomplete
         factorization of a matrix stored in symmetric or
         nonsymmetric diagonal storage mode.  The vectorized
         routine should perform better for matrix factorization
         patterns which have Property A.  \mbox{[Default: $1$]}
 
         \begin{tabular}{ll}
           $0$ & use scalar routine \\
           $1$ & use vectorized routine (default) \\
         \end{tabular}
 
\item[IPARM(22) \hfill](IQLR):
         IQLR specifies the orientation of the basic preconditioner.
         The value of IQLR can be in the range $0$ to $3$.
         \mbox{[Default: $1$]}
 
         \begin{tabular}{ll}
            $0$ & no basic preconditioner \\
            $1$ & left preconditioning (default) \\
            $2$ & right preconditioning   \\
            $3$ & split preconditioning
         \end{tabular}
 
\item[IPARM(23) \hfill](ISYMM):
         ISYMM is a symmetry switch for the matrix.  It is used
         only for the primary format.  If the matrix is symmetric,
         a considerable savings in storage is possible if a
         factorization preconditioner is called.  ISYMM can assume
         the following values on input:
 
         \begin{tabular}{lp{3.5in}}
           $0$ & matrix is symmetric \\
           $1$ & matrix is nonsymmetric \\
           $2$ & it is unknown if the matrix is symmetric; NSPCG
                 should determine if the matrix is symmetric or not
                 (default)
         \end{tabular}
 
         If ISYMM $= 2$ and NSTORE $= 1$ on input, ISYMM is set to
         $0$ or $1$ on output.  \mbox{[Default: $2$]}
 
\item[IPARM(24) \hfill](IELIM):
         IELIM is a switch for effectively removing rows and columns
         when the diagonal entry is extremely large compared to the
         nonzero off-diagonal entries in that row.  See the description
         for TOL [= RPARM(15)] for additional details.
         \mbox{[Default: $0$]}
 
         \begin{tabular}{ll}
           $0$ & test not done \\
           $1$ & test done for removal of rows and columns
         \end{tabular}
 
\item[IPARM(25) \hfill](NDEG):
         NDEG specifies the degree of the polynomial preconditioner.
         \mbox{[Default: $1$]}

\end{list}
 
\newpage
\bigskip
\noindent
{\bf Description of RPARM parameters}
\bigskip
 
\begin{list}{}{\labelwidth 0.9in
               \leftmargin 1.00in \rightmargin 0.25in}
\item[RPARM(1) \hfill](ZETA):
         ZETA is the stopping test value or approximate relative
         accuracy desired in the final computer solution.  Iteration
         terminates when the stopping test is less than ZETA.  If
         the method does not converge in ITMAX iterations, ZETA is
         reset to an estimate of the relative accuracy achieved.
         \mbox{[Default:  $10^{-6}$]}
 
\item[RPARM(2) \hfill](EMAX):
         EMAX is an eigenvalue estimate of the preconditioned matrix
         $Q^{-1}A$.  In the SPD case, EMAX is an estimate of the largest 
         eigenvalue of $Q^{-1}A$.  In the nonsymmetric case, EMAX is an 
         estimate of the $2$-norm of $Q^{-1}A$.  EMAX contains on output 
         a final adapted value if MAXADP~$= 1$ and the acceleration allows 
         estimation of eigenvalues. \mbox{[Default: $2.0$]}
 
\item[RPARM(3) \hfill](EMIN):
         EMIN is an eigenvalue estimate of the preconditioned matrix
         $Q^{-1}A$.  In the SPD case, EMIN is an estimate of the smallest
         eigenvalue of $Q^{-1}A$.  In the nonsymmetric case, EMIN is an 
         estimate of the $2$-norm of the inverse of $Q^{-1}A$. 
         EMIN contains on output a final adapted value if 
         MINADP~$= 1$ and the acceleration allows estimation of 
         eigenvalues. \mbox{[Default: $1.0$]}
 
\item[RPARM(4) \hfill](FF):
         FF is an adaptive procedure damping factor for the estimation
         of OMEGA for the SSOR methods.  Its values lie in the interval
         $(0,1]$ with $1.0$ causing the most frequent parameter
         changes when IOMGAD $= 1$ is specified. \mbox{[Default: $0.75$]}
 
\item[RPARM(5) \hfill](FFF):
         FFF is an adaptive procedure damping factor for changing
         EMAX and EMIN in the Chebyshev accelerations (SI and SRSI).
         Its values lie in the interval $(0,1]$ with $1.0$ causing the
         most frequent parameter changes when MAXADP~$= 1$ and
         MINADP~$= 1$ are specified. \mbox{[Default: $0.75$]}
 
\item[RPARM(6) \hfill](TIMIT):
         TIMIT on output is the iteration time in seconds of the
         NSPCG call.  The iteration time includes the time to perform
         all the iterations, including the time to perform the
         stopping test.  \mbox{[Default: $0.0$]}
 
\item[RPARM(7) \hfill](DIGIT1):
         DIGIT1 is one measure of the approximate number of digits
         of accuracy of the solution.  DIGIT1 is computed as the
         negative of the logarithm base $10$ of the final value
         of the stopping test. \mbox{[Default: $0.0$]}
 
\item[RPARM(8) \hfill](DIGIT2):
         DIGIT2 is the approximate number of digits of accuracy
         using the estimated relative residual with the final
         approximate solution.  DIGIT2 is computed as the negative
         of the logarithm base $10$ of the ratio of the $2$-norm
         of the residual vector and the $2$-norm of the right-hand-side
         vector.  This estimate is unrelated to the condition
         number of the original system and therefore it will not
         be accurate if the system is ill-conditioned. 
         \mbox{[Default: $0.0$]}
 
         Note:  DIGIT1 is determined from the actual stopping test
         computed on the final iteration, whereas DIGIT2 is based
         on the computed residual vector using the final approximate
         solution after the algorithm has terminated.  If these
         values differ greatly, then either the stopping test has
         not worked successfully or the original system is
         ill-conditioned.
 
\item[RPARM(9) \hfill](OMEGA):
         OMEGA serves two purposes:
         \begin{enumerate}
          \item
          It is the overrelaxation parameter $\om$ for the SOR
          and SSOR methods.  OMEGA contains on output a final
          adapted value if IOMGAD $= 1$ is specified and the
          acceleration allows estimation of $\om$ (SOR, SRCG, and
          SRSI only).  Otherwise, OMEGA is not changed and the
          fixed value of OMEGA is used throughout the iterations.
          \item
          It can be used in the modified incomplete factorization
          methods (both point and block) to specify a degree of
          modification.  In the unmodified incomplete
          factorization method, a factor element is discarded
          if it results in fill-in outside the prespecified
          fill-in region (determined by LVFILL).  In the
          modified incomplete factorization method, that factor
          element is added to the diagonal element of the row
          in which it would have caused the fill-in.  OMEGA
          can be used to indicate that $\om *$(factor element)
          should be added to the diagonal element instead, where
          $\om$ lies in the interval $[0,1]$.  Thus, $\om=1$
          corresponds
          to full modification and $\om=0$ corresponds to no
          modification.  This facility is useful, for example,
          when the IC (ILU) factorization of a matrix exists,
          but the MIC (MILU) factorization does not.  Then a
          value of $\om$ between $0$ and $1$ can be chosen
          with one
          of the preconditioners MIC, MBIC, or MBICX to get
          a stronger factorization than IC, BIC, or BICX,
          respectively.
         \end{enumerate}
         \mbox{[Default: $1.0$]}
 
\item[RPARM(10) \hfill](ALPHAB):
         ALPHAB is an estimate of the minimum eigenvalue of
         $-D^{-1}(C_L+C_U)$ where $A=D-C_L-C_U$ for the
         SSOR methods.  ALPHAB contains on output a final estimated
         value if IOMGAD $= 1$ is specified and the acceleration allows
         estimation of ALPHAB  (SRCG and SRSI only).  ALPHAB only
         affects the SSOR and LSSOR preconditioners, and is used in
         the adaptive procedure for $\om$. \mbox{[Default: $0.0$]}
 
\item[RPARM(11) \hfill](BETAB):
         BETAB is an estimate of the maximum eigenvalue of
         $D^{-1}C_LD^{-1}C_U$ where $A=D-C_L-C_U$ for the SSOR
         methods.  BETAB contains on output a final estimated value
         if IOMGAD $= 1$ is specified and the acceleration allows
         estimation of BETAB  (SRCG and SRSI only).  BETAB only
         affects the SSOR and LSSOR preconditioners, and is used in
         the $\om$ adaptive procedure.  \mbox{[Default: $0.25$]}
 
\item[RPARM(12) \hfill](SPECR):
         SPECR is an estimate of the spectral radius of the SOR
         iteration matrix.  SPECR contains on output a final
         estimated value if IOMGAD $= 1$ is specified and the
         acceleration allows estimation of SPECR (SOR only).
         SPECR only affects the SOR and LSOR preconditioners, and
         is used in the $\om$ adaptive procedure. \mbox{[Default: $0.0$]}
 
\item[RPARM(13) \hfill](TIMFAC):
         TIMFAC on output is the factorization time in seconds
         required in the NSPCG call. \mbox{[Default: $0.0$]}
 
\item[RPARM(14) \hfill](TIMTOT):
         TIMTOT on output is the total time in seconds for the
         NSPCG call.  TIMTOT $=$ TIMFAC $+$ TIMIT $+$ other where
         ``other" includes scaling and permuting, if requested.
         \mbox{[Default: $0.0$]}
 
\item[RPARM(15) \hfill](TOL):
         TOL is a tolerance factor used for eliminating certain
         equations when IELIM $= 1$ is selected.  In that case,
         rows are eliminated for which the ratio of the sum of
         the absolute values of the off-diagonal elements to the
         absolute value of the diagonal element is small (less
         than TOL).  This is done by dividing the right-hand-side
         entry for that equation by the diagonal entry, setting
         the diagonal entry equal to one, and setting the
         off-diagonal entries of that row to zero.  The off-diagonal
         entries of the corresponding column are also set to zero
         after correcting the right-hand-side vector.  This
         procedure is useful for linear systems arising from
         finite element discretizations of partial differential
         equations in which Dirichlet boundary conditions are
         handled by penalty methods (giving the diagonal values
         of the corresponding equations extremely large values).
         (The installer of this package should set the value of
         SRELPR.  See comments in the Installation Guide in
         Section~\ref{install} for additional details.)
         \mbox{[Default: $500*$SRELPR]}
 
\item[RPARM(16) \hfill](AINF):
         AINF is the infinity norm of the matrix $A$ if the LSP
         preconditioner is used and the infinity norm of 
         $\lp\pD\rp^{-1}A$ if the LLSP preconditioner is used.  
         These preconditioners are only effective if the matrix 
         is SPD or nearly so.  If the user does not overwrite 
         the default value, zero, the program attempts to calculate 
         a value for this quantity. \mbox{[Default: $0.0$]}
 
\end{list}

\newpage
\begin{table}
   The default values for the IPARM and RPARM variables are
 given in the tables below.
\begin{center}
\begin{tabular}{|l|l|c|}                    \hline
   Position   &   Name   &   Default \\     \hline
   IPARM(1)   &   NTEST  &     $2$   \\     \hline
   IPARM(2)   &   ITMAX  &    $100$  \\     \hline
   IPARM(3)   &   LEVEL  &     $0$   \\     \hline
   IPARM(4)   &   NOUT   &     $6$   \\     \hline
   IPARM(5)   &   IDGTS  &     $0$   \\     \hline
   IPARM(6)   &   MAXADP &     $1$   \\     \hline
   IPARM(7)   &   MINADP &     $1$   \\     \hline
   IPARM(8)   &   IOMGAD &     $1$   \\     \hline
   IPARM(9)   &   NS1    &     $5$   \\     \hline
   IPARM(10)  &   NS2    &  $100000$ \\     \hline
   IPARM(11)  &   NS3    &     $0$   \\     \hline
   IPARM(12)  &   NSTORE &     $2$   \\     \hline
   IPARM(13)  &   ISCALE &     $0$   \\     \hline
   IPARM(14)  &   IPERM  &     $0$   \\     \hline
   IPARM(15)  &   IFACT  &     $1$   \\     \hline
   IPARM(16)  &   LVFILL &     $0$   \\     \hline
   IPARM(17)  &   LTRUNC &     $0$   \\     \hline
   IPARM(18)  &   IPROPA &     $2$   \\     \hline
   IPARM(19)  &   KBLSZ  &    $-1$   \\     \hline
   IPARM(20)  &   NBL2D  &    $-1$   \\     \hline
   IPARM(21)  &   IFCTV  &     $1$   \\     \hline
   IPARM(22)  &   IQLR   &     $1$   \\     \hline
   IPARM(23)  &   ISYMM  &     $2$   \\     \hline
   IPARM(24)  &   IELIM  &     $0$   \\     \hline
   IPARM(25)  &   NDEG   &     $1$   \\     \hline
\end{tabular}
\caption{Default Values for IPARM Variables}
\end{center}
\end{table}
\begin{table}
\begin{center}
\begin{tabular}{|l|l|c|}                       \hline
   Position   &   Name   &   Default     \\    \hline
   RPARM(1)   &   ZETA   &   $10^{-6}$   \\    \hline
   RPARM(2)   &   EMAX   &    $2.0$      \\    \hline
   RPARM(3)   &   EMIN   &    $1.0$      \\    \hline
   RPARM(4)   &   FF     &    $0.75$     \\    \hline
   RPARM(5)   &   FFF    &    $0.75$     \\    \hline
   RPARM(6)   &   TIMIT  &    $0.0$      \\    \hline
   RPARM(7)   &   DIGIT1 &    $0.0$      \\    \hline
   RPARM(8)   &   DIGIT2 &    $0.0$      \\    \hline
   RPARM(9)   &   OMEGA  &    $1.0$      \\    \hline
   RPARM(10)  &   ALPHAB &    $0.0$      \\    \hline
   RPARM(11)  &   BETAB  &    $0.25$     \\    \hline
   RPARM(12)  &   SPECR  &    $0.0$      \\    \hline
   RPARM(13)  &   TIMFAC &    $0.0$      \\    \hline
   RPARM(14)  &   TIMTOT &    $0.0$      \\    \hline
   RPARM(15)  &   TOL    &  $500*\mbox{SRELPR}$  \\    \hline
   RPARM(16)  &   AINF   &    $0.0$      \\    \hline
\end{tabular}
\caption{Default Values for RPARM Variables}
\end{center}
\end{table}
 
\clearpage
\section{Storage Modes}
\label{storage}
\indent
 
Many factors must be taken into account when choosing the operator 
representation for $A$ in an iterative solution method.  An iterative 
solver is typically one of many components in an application, and 
the application may suggest an appropriate iterative solution method 
and representation of the operator.  For example, in many finite 
element applications, it is convenient to represent the global stiffness 
matrix $A$ in an element-based data structure rather than
assembling it explicitly.  A suitable iterative method in that case
might be an acceleration without preconditioning since it is then
only necessary to compute matrix-vector products which can be computed
using the element stiffness matrices in an element-by-element approach.  
Also, resource limitations of the computer may affect the method of 
representing the operator $A$.  Storage demands may require that 
the matrix be regenerated for every iteration or that the effect 
of an application of the operator $A$ on a vector be represented 
implicitly.  For details in using NSPCG in matrix format-free
mode, see Section~\ref{mffm}. 

The properties of the operator may also have a strong bearing 
on the choice of operator representation. The most efficient 
iterative solution codes exploit knowledge of any special 
characteristics of the matrix such as nonzero pattern, symmetry 
in structure or coefficients, and the presence or absence of constant
coefficients.  For purposes of computational and storage
efficiency, a sparse matrix data structure should normally 
be chosen which matches the sparsity pattern of the operator 
$A$.  In particular, the vectorization and parallelization of
preconditioners depends strongly upon the exploitation of
the sparsity pattern.  If, on the other hand, a sparse 
matrix data structure is not compatible with the matrix sparsity 
pattern, the potential of storing and doing computations 
with too many zeros arises.  Symmetry in matrix structure or 
coefficients may allow economies in storage costs.  Finally, 
if the operator has constant coefficients, it would be wasteful 
in a production code to use a data structure which stores all 
the nonzero coefficients of the matrix. 

   NSPCG currently allows several storage modes for representing
the coefficient matrix.  A short description of each follows.
In considering these storage schemes, the user should select
the scheme most convenient for the application.  If the matrix has
a diagonal or block structure, then storage by diagonals can be a
very efficient way to represent the matrix.  Also, block
preconditioners can be selected with this storage mode, many of which
have excellent convergence properties.  In addition, many
computations vectorize with this storage mode, making the package
more efficient on pipelined computers.  On the other hand, if the
matrix is unstructured, then storage by diagonals will result in a
great many zeros being stored and computed upon.  In this case,
the primary storage format may be more desirable.  The primary
format will be efficient if there is a relatively constant number
of nonzeros per equation.  If one equation has many more nonzeros
than the rest, then MAXNZ will have to be large enough to
accommodate it, resulting in many zeros being stored for the other
equations.  If this is the case, coordinate storage format can be
used.
 
 
\begin{description}
\newpage
\item[Primary  Format]:
  In this storage format, there are two rectangular arrays COEF and
  JCOEF,
  one real and one integer, which are used to store a representation
  of the coefficient matrix.  Both arrays are dimensioned NDIM by
  MDIM, where NDIM is at least as large as N, the size of the system,
  and MDIM is at least as large as MAXNZ, the maximum number of
  nonzeros per row in the matrix, with the maximum being taken over
  all rows.  Each row in COEF will contain the nonzeros of the
  corresponding row in the full matrix $A$, and the corresponding
  row in JCOEF will contain its respective column numbers.
 
     For example, the matrix
  \[
     A = \lp \begin{array}{rrrrr}
                 11 &  0 &  0 & 14 & 15 \\
                  0 & 22 &  0 &  0 &  0 \\
                  0 &  0 & 33 &  0 &  0 \\
                 14 &  0 &  0 & 44 & 45 \\
                 15 &  0 &  0 & 45 & 55
        \end{array} \rp
  \]
  would be represented in the COEF and JCOEF arrays as
  \[
     \mbox{COEF} = \lp \begin{array}{rrr}
            11 & 14 & 15 \\
            22 &  0 &  0 \\
            33 &  0 &  0 \\
            44 & 14 & 45 \\
            55 & 15 & 45
      \end{array} \rp
  \]
  \[
     \mbox{JCOEF} = \lp \begin{array}{rrr}
             1 & 4 & 5 \\
             2 & 0 & 0 \\
             3 & 0 & 0 \\
             4 & 1 & 5 \\
             5 & 1 & 4
      \end{array} \rp
  \]
 
  There are several remarks which should be made ---
  \begin{enumerate}
  \item
       If a row in matrix $A$ has fewer than MAXNZ nonzeros,
       the corresponding rows in COEF and JCOEF should be
       padded with zeros.
  \item
       The nonzero entries in a given row of COEF may appear
       in any order.  However, if the diagonal element is not
       in column one, then NSPCG will place it there without
       returning it to its original position on exiting.
  \item
       Several splittings will attempt to rearrange COEF and
       JCOEF in order to make the preconditioning solution
       process more efficient and vectorizable.  They may
       even attempt to expand the matrix within the limit
       set by MDIM, increasing MAXNZ in the process.  However,
       the matrix returned in COEF and JCOEF at output will be
       equivalent to the one at input to within roundoff
       error.  Slight roundoff error in the coefficient
       arrays and RHS will result if the user requests that
       scaling of the matrix be done.
  \item
       For this data format, all nonzero matrix entries must be
       present even if the matrix is symmetric.  It does not
       suffice to store just the upper or lower triangle of
       the matrix $A$.
  \end{enumerate}
 
\newpage
\item[Symmetric Diagonal Format]:
  This storage format uses a real rectangular array and an
  integer vector to store a representation of the matrix.  COEF
  is dimensioned NDIM by MDIM and JCOEF is dimensioned to length
  MDIM.  NDIM is at least as large as N, the size of the linear
  system, and similarly MDIM is at least as large as MAXNZ, the
  number of diagonals to be stored.  Each column of COEF contains
  a diagonal of $A$ and JCOEF contains a nonnegative integer for
  each diagonal indicating its distance from the main diagonal.
  For example, the main diagonal has a distance of $0$, the first
  superdiagonal has a distance of $1$, etc.  This storage format may
  be used only if matrix $A$ is symmetric.  Only the main diagonal
  and nonzero diagonals appearing in the upper triangular part of
  $A$ are stored.
 
  For example, the matrix
  \[
     A = \lp \begin{array}{rrrrr}
                 11 & 12 &  0 & 14 &  0 \\
                 12 & 22 & 23 &  0 & 25 \\
                  0 & 23 & 33 & 34 &  0 \\
                 14 &  0 & 34 & 44 & 45 \\
                  0 & 25 &  0 & 45 & 55
        \end{array} \rp
  \]
  would be represented in the COEF and JCOEF arrays as
  \[
     \mbox{COEF} = \lp \begin{array}{rrr}
            11 & 12 & 14 \\
            22 & 23 & 25 \\
            33 & 34 &  0 \\
            44 & 45 &  0 \\
            55 &  0 &  0
      \end{array} \rp
  \]
  \[
     \mbox{JCOEF} = \lp \begin{array}{r}
             0 \\
             1 \\
             3
      \end{array} \rp
  \]
 
  There are several remarks which should be made ---
  \begin{enumerate}
  \item
       Element $a_{ij}$ of the matrix always appears in row $i$
       of the COEF array.  Short diagonals must be padded with
       zeros at the bottom.  In other words, the diagonals are
       top-justified.
  \item
       The diagonals of $A$ may be stored in any order.  However,
       if the main diagonal of $A$ does not appear in column one
       of COEF, then NSPCG will place it there without returning
       it to its original position on exiting.
  \item
       As with the previous format, many preconditioners will
       attempt to rearrange COEF and JCOEF and possibly expand
       COEF within the limit set by MDIM, changing MAXNZ in the
       process.
  \end{enumerate}
 
\newpage
\item[Nonsymmetric Diagonal Format]:
  This storage format is similar to the one described above
  except that all nonzero diagonals must be stored.  If a diagonal
  of $A$ is below the main diagonal, it's corresponding entry in JCOEF
  is negative.  Thus, the first sub-diagonal has a JCOEF entry of
  $-1$, the second sub-diagonal has a JCOEF entry of $-2$, etc.  This
  storage format is intended for nonsymmetric diagonally structured
  matrices.
 
  For example, the matrix
  \[
     A = \lp \begin{array}{rrrrr}
                 11 & 10 &  0 & 14 &  0 \\
                 12 & 22 & 21 &  0 & 25 \\
                  0 & 23 & 33 & 32 &  0 \\
                 30 &  0 & 34 & 44 & 43 \\
                  0 & 25 &  0 & 45 & 55
        \end{array} \rp
  \]
  would be represented in the COEF and JCOEF arrays as
  \[
     \mbox{COEF} = \lp \begin{array}{rrrrr}
            11 & 14 & 10 &  0 &  0 \\
            22 & 25 & 21 & 12 &  0 \\
            33 &  0 & 32 & 23 &  0 \\
            44 &  0 & 43 & 34 & 30 \\
            55 &  0 &  0 & 45 & 25
      \end{array} \rp
  \]
  \[
     \mbox{JCOEF} = \lp \begin{array}{r}
             0 \\
             3 \\
             1 \\
            -1 \\
            -3
      \end{array} \rp
  \]
 
  There are several remarks which should be made ---
  \begin{enumerate}
  \item
       Element $a_{ij}$ of the matrix always appears in row $i$ of
       the COEF array.  Short diagonals should be padded at the
       bottom with zeros if they are superdiagonals and should
       be padded at the top with zeros if they are subdiagonals.
       Thus, superdiagonals are top-justified and subdiagonals
       are bottom-justified.
  \item
       The diagonals of $A$ may be stored in COEF in any order.
       However, if the main diagonal of $A$ does not appear in
       column one of COEF, then NSPCG will place it there
       without returning it to its original position on exiting.
  \item
       As with the previous format, many preconditioners may
       attempt to rearrange COEF and JCOEF and possibly expand
       COEF within the limit set by MDIM, changing MAXNZ in the
       process.
  \end{enumerate}
 
\newpage
\item[Symmetric Coordinate Format]:
  This storage format uses a real vector and an integer array
  to store a representation of the matrix.  COEF is dimensioned
  to length NDIM and JCOEF is dimensioned NDIM by $2$.
  COEF contains the nonzero elements of the matrix and JCOEF
  contains the corresponding $i$ values in column $1$ and the
  corresponding $j$ values in column $2$.  Thus if
  COEF(k) $= a_{i,j}$, then JCOEF(k,1) $=i$ and JCOEF(k,2) $=j$.
  NDIM is at least as large as MAXNZ, the number of nonzeros
  stored, and MDIM can be set to anything.
  This storage format may be used only if matrix $A$ is symmetric.
  Only the main diagonal and nonzero coefficients appearing in the
  upper triangular part of $A$ are stored.
 
  For example, the matrix
  \[
     A = \lp \begin{array}{rrrrr}
                 11 & 12 &  0 & 14 &  0 \\
                 12 & 22 & 23 &  0 & 25 \\
                  0 & 23 & 33 & 34 &  0 \\
                 14 &  0 & 34 & 44 & 45 \\
                  0 & 25 &  0 & 45 & 55
        \end{array} \rp
  \]
  would be represented in the COEF and JCOEF arrays as
  \[
     \mbox{COEF} = (11,22,33,44,55,12,23,34,45,14,25)
  \]
  \[
     \mbox{JCOEF} = \lp \begin{array}{cc}
             1 & 1 \\
             2 & 2 \\
             3 & 3 \\
             4 & 4 \\
             5 & 5 \\
             1 & 2 \\
             2 & 3 \\
             3 & 4 \\
             4 & 5 \\
             1 & 4 \\
             2 & 5
      \end{array} \rp
  \]
 
  For this example, N $=5$, MAXNZ $=11$, NDIM $=11$, and MDIM can
  be set to anything.
 
  There are several remarks which should be made ---
  \begin{enumerate}
  \item
       All elements of the main diagonal must be stored, even if
       some are zeros.
 
  \item
       Duplicate entries (entries having the same $i$ and $j$
       coordinates) are allowed.  NSPCG removes duplicates by
       adding their corresponding coefficient values.  Thus,
       MAXNZ may be reduced upon output.
 
  \item
       The nonzeros of $A$ may be stored in any order.  However,
       NSPCG will rearrange the elements of the COEF and JCOEF
       arrays into a convenient order without returning
       it to its original position on exiting.  In particular,
       the diagonal elements of the matrix will be placed into
       the first N locations of COEF.
  \end{enumerate}
 
\newpage
\item[Nonsymmetric Coordinate Format]:
  This storage format is similar to the one described above
  except that all nonzero coefficients must be stored.
  This storage format is intended for nonsymmetric matrices.
 
  For example, the matrix
  \[
     A = \lp \begin{array}{rrrrr}
                 11 & 10 &  0 & 14 &  0 \\
                 12 & 22 & 21 &  0 & 25 \\
                  0 & 23 & 33 & 32 &  0 \\
                 30 &  0 & 34 & 44 & 43 \\
                  0 & 25 &  0 & 45 & 55
        \end{array} \rp
  \]
  would be represented in the COEF and JCOEF arrays as
  \[
     \mbox{COEF} = (11,22,33,44,55,14,25,10,21, \\
                    32,43,12,23,34,45,30,25)
  \]
  \[
     \mbox{JCOEF} = \lp \begin{array}{cc}
             1 & 1 \\
             2 & 2 \\
             3 & 3 \\
             4 & 4 \\
             5 & 5 \\
             1 & 4 \\
             2 & 5 \\
             1 & 2 \\
             2 & 3 \\
             3 & 4 \\
             4 & 5 \\
             2 & 1 \\
             3 & 2 \\
             4 & 3 \\
             5 & 4 \\
             4 & 1 \\
             5 & 2
      \end{array} \rp
  \]
 
  For this example, N $=5$, MAXNZ $=17$, NDIM $=17$, and MDIM can
  be set to anything.
 
  The remarks regarding symmetric coordinate storage format apply
  for the nonsymmetric format also.
 
\end{description}
 
\newpage
\section{Workspace Requirements}
\label{nw}
\indent

  In this section, estimated maximum requirements in NSPCG for real 
and integer workspace are given.  These values should be used initially
for the NW and INW parameters in the NSPCG calling sequence, and 
parameters WKSP and IWKSP should be dimensioned to at least NW and
INW, respectively.  If insufficient real or integer workspace is
provided, a fatal error occurs with IER set to $-2$ or $-3$ and NW or
INW set to the amount of workspace needed at the point execution
terminated.  Thus, an easy way to determine storage requirements is
to run the package repeatedly using the values of NW and INW suggested
by NSPCG from the previous run until enough workspace is supplied.  
On the other hand, if sufficient real and integer workspace
is provided, NW and INW are set on output to the amount of real and
integer workspace actually required.  Thus, workspace requirements can
be reduced to these levels when rerunning the same problem.

  Workspace requirements can often be reduced if the user chooses
certain IPARM quantities carefully.  For the symmetric accelerators,
the choice of stopping test, as determined by NTEST [= IPARM(1)],
has no effect on workspace requirements.  For the nonsymmetric 
accelerators, some stopping tests can increase storage demands by
up to 3N.  Thus a stopping test should be chosen which is efficient 
in time and memory.  See Section~\ref{stop} for information regarding 
the selection of stopping tests in the nonsymmetric case.   
Also,
2N integer workspace can be saved if the user informs NSPCG whether
or not matrix $A$ has Property A if this property affects a
preconditioner.  This is done by setting IPROPA [= IPARM(18)]
appropriately.

   In the following formulas, let

\bigskip
\begin{tabular}{ll}
N      & be the problem size \\
MAXNZ  & be the active column width of the COEF array \\
ITMAX  & be IPARM(2) \\
NS1    & be IPARM(9) \\
NS2    & be IPARM(10) \\
LVFILL & be IPARM(16) \\
LTRUNC & be IPARM(17) \\
IPROPA & be IPARM(18) \\
KBLSZ  & be IPARM(19) \\
NBL2D  & be IPARM(20) \\
ISYMM  & be IPARM(23) \\
NV     & be $\max (1,\min ({\rm NS1}, {\rm NS2}-1))$ \\
NH     & be $2 + \min ({\rm ITMAX}, {\rm NS2})$ \\ 
NBLK   & be N/NBL2D \\
NFACTI & be the number of diagonals in the factorization \\
NCMAX  & be the maximum number of nodes of the same color in a \\
       &  multicolor preconditioner \\
KEYGS  & be the gather/scatter switch set in routine DFAULT \\
       & (see Section~\ref{install} for details on this switch) \\
DB     & be the full bandwidth of $\pD$ where $A=\pD+\pL+\pU$ \\
       & in block notation \\
DHB    & be the half bandwidth of $\pD$ \\
ABAND  & be the full bandwidth of matrix $A$, in 2-D blocks \\
       & (e.g., ABAND = 3 for a 3-D 7-point operator on a box)
\end{tabular}

\bigskip
\noindent
Then
\begin{eqnarray*}
{\rm NW}  & = & {\rm NWA}_{\la {\em accel} \ra} 
              + {\rm NWS}_{\la {\em accel} \ra} 
              + {\rm NWP}_{\la {\em precon} \ra} 
              + {\rm NWF}_{\la {\em precon} \ra} \\
{\rm INW} & = & {\rm INWP}_{\la {\em precon} \ra} 
              + {\rm INWF}_{\la {\em precon} \ra} 
\end{eqnarray*}

   As noted above, ${\rm NWS}_{\la {\em accel} \ra}=0$ if 
$\la$~{\em accel}~$\ra$ 
is CG, SI, SOR, SRCG, or SRSI, and can be up to a maximum of 
$3{\rm N}$ otherwise.  A table giving the real workspace required for 
each accelerator, ${\rm NWA}_{\la {\em accel} \ra}$, is given below, followed
by a table giving the real and integer workspace required for each 
preconditioner, ${\rm NWP}_{\la {\em precon} \ra}$ and 
${\rm INWP}_{\la {\em precon} \ra}$.  Finally, a table is shown giving 
real and integer workspace requirements for factorizations for certain 
preconditioners, ${\rm NWF}_{\la {\em precon} \ra}$ and 
${\rm INWF}_{\la {\em precon} \ra}$.
\clearpage

\begin{table}
\begin{center}
\begin{tabular}{|l|l|}
\hline
$\la$~{\em accel}~$\ra$ & ${\rm NWA}_{\la {\em accel} \ra}$ \\
\hline
CG     & $3{\rm N}+2*{\rm ITMAX}$ \\ \hline
SI     & $4{\rm N}$ \\ \hline
SRCG   & $3{\rm N}+2*{\rm ITMAX}$ \\ \hline
SRSI   & $4{\rm N}$ \\ \hline
SOR    & $2{\rm N}$ \\ \hline
BASIC  & $2{\rm N}$ \\ \hline
ME     & $9{\rm N}$ \\ \hline
CGNR   & $4{\rm N}+2*{\rm ITMAX}$ \\ \hline
LSQR   & $5{\rm N}$ \\ \hline
ODIR   & $(2*{\rm NV}+5)*{\rm N}+{\rm NV}$ \\ \hline
OMIN   & $3*{\rm NV}+2+(2*{\rm NV}+6)*{\rm N}+{\rm NH}*
         ({\rm NV}+2)+{\rm NH}^2 + 2*{\rm NH}$ \\ \hline
ORES   & $(2*{\rm NV}+4)*{\rm N}+{\rm NV}+1$ \\ \hline
IOM    & $(2*{\rm NS1}+2)*{\rm N}+5*{\rm NS1}+1$ \\ \hline
GMRES  & ${\rm NH}*({\rm NV}+3)+{\rm N}*({\rm NV}+3)+2*({\rm NV}+2)^2
          +7*{\rm NV}+17+{\rm NH}^2+2*{\rm NH}$ \\
       & Add ${\rm N}*({\rm NV}+1)$ if IQLR $=2$ or $3$ \\
       & Add ${\rm N}*({\rm NV}+3)+{\rm NV}+1$ if ${\rm NS1}<{\rm NS2}-1$
          \\ \hline 
USYMLQ & $8{\rm N}+11$ \\ \hline
USYMQR & $8{\rm N}+14$ \\ \hline
LDIR   & $8{\rm N}$ \\ \hline
LMIN   & $8{\rm N}$ \\ \hline
LRES   & $7{\rm N}$ \\ \hline
BCGS   & $9{\rm N}$ \\ \hline
CGCR   & OMIN workspace plus ${\rm N}+{\rm NBLK}+{\rm NBLK}*{\rm ABAND}$
          \\ \hline
\end{tabular}
\caption{Values for ${\rm NWA}_{\la {\em accel} \ra}$ }
\end{center}
\end{table}

\begin{table}
\begin{center}
\begin{tabular}{|l|l|l|} \hline
$\la$~{\em precon}~$\ra$ & ${\rm NWP}_{\la {\em precon} \ra}$ 
                       & ${\rm INWP}_{\la {\em precon} \ra}$ \\ \hline
RICH1  & 0, N if KEYGS = 1                     & 0 \\ \hline
JAC1   & 0, N if KEYGS = 1                     & 0 \\ \hline
SOR1   & N                                     & 0 \\ \hline
SSOR1  & N, $2{\rm N}$ if ISYMM = 1            & 0 \\ \hline
IC1    & N                                     & 0 \\ \hline
MIC1   & N                                     & 0 \\ \hline
LSP1   & $2{\rm N}$, $3{\rm N}$ if KEYGS = 1   & 0 \\ \hline
NEU1   & N, $2{\rm N}$ if KEYGS = 1            & 0 \\ \hline
RICH2  & 0                                     & 0 \\ \hline
JAC2   & 0                                     & 0 \\ \hline
LJAC2  & 0                                     & 0 \\ \hline
LJACX2 & 0                                     & 0 \\ \hline
SOR2   & 0                                     & MAXNZ \\ \hline
SSOR2  & 0                                     & MAXNZ \\ \hline
IC2    & 0                                     & NFACTI \\ \hline
MIC2   & 0                                     & NFACTI \\ \hline
LSP2   & $2{\rm N}$                            & 0 \\ \hline
NEU2   & N                                     & 0 \\ \hline
\end{tabular}
\caption{Values for ${\rm NWP}_{\la {\em precon} \ra}$ and
                   ${\rm INWP}_{\la {\em precon} \ra}$}
\end{center}
\end{table}

\begin{table}
\begin{center}
\begin{tabular}{|l|l|l|} \hline
$\la$~{\em precon}~$\ra$ & ${\rm NWP}_{\la {\em precon} \ra}$ 
                       & ${\rm INWP}_{\la {\em precon} \ra}$ \\ \hline
LSOR2  & 0                                     & 0 \\ \hline
LSSOR2 & N                                     & 0 \\ \hline
BIC2   & KBLSZ                                 & 0 \\ \hline
MBIC2  & KBLSZ                                 & 0 \\ \hline
BICX2  & KBLSZ                                 & 0 \\ \hline
MBICX2 & KBLSZ                                 & 0 \\ \hline
LLSP2  & $2{\rm N}$                            & 0 \\ \hline
LNEU2  & $2{\rm N}$                            & 0 \\ \hline
RICH3  & 0                                     & 0 \\ \hline
JAC3   & 0                                     & 0 \\ \hline
LJAC3  & 0                                     & 0 \\ \hline
LJACX3 & 0                                     & 0 \\ \hline
SOR3   & 0                                     & MAXNZ \\ \hline
SSOR3  & N                                     & MAXNZ \\ \hline
IC3    & 0                                     & NFACTI \\ \hline
MIC3   & 0                                     & NFACTI \\ \hline
LSP3   & $2{\rm N}$                            & 0 \\ \hline
NEU3   & N                                     & 0 \\ \hline
LSOR3  & 0                                     & 0 \\ \hline
LSSOR3 & N                                     & 0 \\ \hline
BIC3   & $2*{\rm KBLSZ}$                       & 0 \\ \hline
MBIC3  & $2*{\rm KBLSZ}$                       & 0 \\ \hline
BICX3  & $2*{\rm KBLSZ}$                       & 0 \\ \hline
MBICX3 & $2*{\rm KBLSZ}$                       & 0 \\ \hline
LLSP3  & $2{\rm N}$                            & 0 \\ \hline
LNEU3  & $2{\rm N}$                            & 0 \\ \hline
RICH4  & 0, $2{\rm N}$ if KEYGS = 1            & 0 \\ \hline
JAC4   & 0, $2{\rm N}$ if KEYGS = 1            & 0 \\ \hline
LSP4   & $2{\rm N}$, $4{\rm N}$ if KEYGS = 1   & 0 \\ \hline
NEU4   & N, $3{\rm N}$ if KEYGS = 1            & 0 \\ \hline
RICH5  & 0, $2{\rm N}$ if KEYGS = 1            & 0 \\ \hline
JAC5   & 0, $2{\rm N}$ if KEYGS = 1            & 0 \\ \hline
LSP5   & $2{\rm N}$, $4{\rm N}$ if KEYGS = 1   & 0 \\ \hline
NEU5   & N, $3{\rm N}$ if KEYGS = 1            & 0 \\ \hline
SOR6   & N                                     & 0 \\ \hline
SSOR6  & ${\rm N}+{\rm NCMAX}$                 & 0 \\ \hline
IC6    & N                                     & 0 \\ \hline
MIC6   & N                                     & 0 \\ \hline
RS6    & $2{\rm N}$                            & 0 \\ \hline
SOR7   & 0                                     & 0 \\ \hline
SSOR7  & N                                     & 0 \\ \hline
BIC7   & $2*{\rm NCMAX}$                       & 0 \\ \hline
MBIC7  & $2*{\rm NCMAX}$                       & 0 \\ \hline
BICX7  & $2*{\rm NCMAX}$                       & 0 \\ \hline
MBICX7 & $2*{\rm NCMAX}$                       & 0 \\ \hline
RS7    & N                                     & 0 \\ \hline
\end{tabular}
\caption{Values for ${\rm NWP}_{\la {\em precon} \ra}$ and
                   ${\rm INWP}_{\la {\em precon} \ra}$, cont.}
\end{center}
\end{table}

\newpage
\begin{table}
\begin{center}
\begin{tabular}{|l|l|l|l|} \hline
$\la$~{\em precon}~$\ra$ & Case & ${\rm NWF}_{\la {\em precon} \ra}$ 
                       & ${\rm INWF}_{\la {\em precon} \ra}$ \\ \hline
IC1,   & IPROPA = 1, LVFILL = 0 & N & 0 \\
MIC1   & IPROPA = 0, LVFILL = 0, ISYMM = 0  
       & $\ha{\rm N}*{\rm MAXNZ}$ & 0 \\ 
       & IPROPA = 0, LVFILL = 0, ISYMM = 1 
       & ${\rm N}*{\rm MAXNZ}$ & 0 \\ 
       & LVFILL $>$ 0, ISYMM = 0
       & $>\ha{\rm N}*{\rm MAXNZ}$ & $>\ha{\rm N}*{\rm MAXNZ}$ \\ 
       & LVFILL $>$ 0, ISYMM = 1
       & $>{\rm N}*{\rm MAXNZ}$ & $>{\rm N}*{\rm MAXNZ}$ \\  
       & IPROPA = 2 & 0 & $\max (2{\rm N},{\rm above})$ \\ \hline
LJAC2  & all cases & ${\rm N}*{\rm DHB}$ & 0 \\ \hline
LJACX2 & all cases & ${\rm N}*({\rm DHB}+{\rm LTRUNC})$ & 0 \\ \hline
IC2,   & IPROPA = 1, LVFILL = 0 & N & ${\rm MAXNZ}+{\rm MAXNZ}^2$ \\
MIC2   & IPROPA = 0, LVFILL = 0  
       & ${\rm N}*{\rm MAXNZ}$ & ${\rm MAXNZ}+{\rm MAXNZ}^2$ \\
       & LVFILL $>$ 0
       & $>{\rm N}*{\rm MAXNZ}$ & $>{\rm MAXNZ}^2$ \\ 
       & IPROPA = 2 & 0 & $\max (2{\rm N},{\rm above})$ \\ \hline
LSOR2  & all cases & ${\rm N}*{\rm DHB}$ & $3({\rm MAXNZ}+1)$ \\ \hline
LSSOR2 & all cases & ${\rm N}*{\rm DHB}$ & $3({\rm MAXNZ}+1)$ \\ \hline
BIC2,  & IPROPA = 1, LVFILL = 0 & ${\rm N}*({\rm DHB}+{\rm LTRUNC})$
                                & $3({\rm MAXNZ}+1)$ \\
MBIC2, & IPROPA = 0, LVFILL = 0, LTRUNC = 0  
       & ${\rm N}*{\rm MAXNZ}$ & $3({\rm MAXNZ}+1)$ \\
BICX2, & otherwise
       & $>{\rm N}*{\rm MAXNZ}$ & $3({\rm MAXNZ}+1)$ \\ 
MBICX2 & IPROPA = 2 & 0 & $\max (2\frac{{\rm N}}{{\rm KBLSZ}},{\rm above})$ 
                     \\ \hline
LLSP2  & all cases & ${\rm N}*{\rm DHB}$ & 0 \\ \hline
LNEU2  & all cases & ${\rm N}*{\rm DHB}$ & 0 \\ \hline
LJAC3  & all cases & ${\rm N}*{\rm DB}$  & 0 \\ \hline
LJACX3 & all cases & ${\rm N}*({\rm DB}+2{\rm LTRUNC})$ & 0 \\ \hline
IC3,   & IPROPA = 1, LVFILL = 0 & N & ${\rm MAXNZ}+{\rm MAXNZ}^2$ \\
MIC3   & IPROPA = 0, LVFILL = 0  
       & ${\rm N}*{\rm MAXNZ}$ & ${\rm MAXNZ}+{\rm MAXNZ}^2$ \\
       & LVFILL $>$ 0
       & $>{\rm N}*{\rm MAXNZ}$ & $>{\rm MAXNZ}^2$ \\ 
       & IPROPA = 2 & 0 & $\max (2{\rm N},{\rm above})$ \\ \hline
LSOR3  & all cases & ${\rm N}*{\rm DB}$ & $3({\rm MAXNZ}+1)$ \\ \hline
LSSOR3 & all cases & ${\rm N}*{\rm DB}$ & $3({\rm MAXNZ}+1)$ \\ \hline
BIC3,  & IPROPA = 1, LVFILL = 0 & ${\rm N}*({\rm DB}+2{\rm LTRUNC})$
                                & $3({\rm MAXNZ}+1)$ \\
MBIC3, & IPROPA = 0, LVFILL = 0, LTRUNC = 0  
       & ${\rm N}*{\rm MAXNZ}$ & $3({\rm MAXNZ}+1)$ \\
BICX3, & otherwise
       & $>{\rm N}*{\rm MAXNZ}$ & $3({\rm MAXNZ}+1)$ \\ 
MBICX3 & IPROPA = 2 & 0 & $\max (2\frac{{\rm N}}{{\rm KBLSZ}},{\rm above})$ 
                     \\ \hline
LLSP3  & all cases & ${\rm N}*{\rm DB}$ & 0 \\ \hline
LNEU3  & all cases & ${\rm N}*{\rm DB}$ & 0 \\ \hline
IC6,   & IPROPA = 1, LVFILL = 0 & N & 0 \\
MIC6   & IPROPA = 0, LVFILL = 0 
       & ${\rm N}*{\rm MAXNZ}$ & 0 \\ 
       & LVFILL $>$ 0 not allowed & &  \\ 
       & IPROPA = 2 & 0 & $\max (2{\rm N},{\rm above})$ \\ \hline
SOR7   & all cases & ${\rm N}*{\rm DB}$ &  \\ \hline
SSOR7  & all cases & ${\rm N}*{\rm DB}$ &  \\ \hline
BIC7,  & IPROPA = 1 & ${\rm N}*({\rm DB}+2{\rm LTRUNC})$ &  \\
MBIC7, & IPROPA = 0, LTRUNC = 0 & ${\rm N}*{\rm MAXNZ}$ &  \\
BICX7, & else & $> {\rm N}*{\rm MAXNZ}$ &  \\
MBICX7 &  &  &  \\ \hline
RS7    & all cases & ${\rm N}*{\rm DB}$ &  \\ \hline
others & all cases & 0                  & 0 \\ \hline
\end{tabular}
\caption{Values for ${\rm NWF}_{\la {\em precon} \ra}$ and
                   ${\rm INWF}_{\la {\em precon} \ra}$}
\end{center}
\end{table}
\clearpage

\newpage
\section{Stopping Tests}
\label{stop}
\indent

    The following stopping tests are permitted for the iterative
methods.  In each case, the iteration process stops whenever the
given quantity falls below ZETA [= RPARM(1)].  The number given in
parentheses is the corresponding value of NTEST [= IPARM(1)].
A large number of stopping tests is provided for experimentation
by the user.  See \cite{NSPCG} for comments regarding the motivation 
in using these tests.
 
\begin{eqnarray}
   \frac{\mbox{EMAX}}{\mbox{EMIN}} \lb \frac
       {\la r^{(n)},\tilde z^{(n)}\ra }
       {\la b,Q^{-1}b \ra} \rb ^\ha
    & < & \zeta                      \\
   \frac{1}{\mbox{EMIN}} \lb \frac
       {\la \tilde z^{(n)},\tilde z^{(n)}\ra }
       {\la u^{(n)},u^{(n)}\ra } \rb ^\ha
    & < & \zeta                      \\
   \frac{\mbox{EMAX}}{\mbox{EMIN}} \lb \frac
       {\la \tilde z^{(n)},\tilde z^{(n)}\ra}
       {\la Q^{-1}b,Q^{-1}b\ra } \rb ^\ha
    & < & \zeta                      \\
   \lb \frac{\la \tilde z^{(n)},\tilde z^{(n)}\ra }
        {\la Q^{-1}b,Q^{-1}b\ra } \rb ^\ha
    & < & \zeta                      \\
   \lb \frac{\la r^{(n)},r^{(n)}\ra}
     {\la b,b \ra } \rb ^\ha
    & < & \zeta                      \\
   \lb \frac{\la u^{(n)}-\bar{u},u^{(n)}-\bar{u} \ra }
     {\la \bar{u},\bar{u} \ra } \rb^\ha
    & < & \zeta                      \\
   \frac{\mbox{EMAX}}{\mbox{EMIN}} \lb \frac
       {\la r^{(n)},z^{(n)}\ra }
       {\la b,Q_L^{-1}b \ra} \rb ^\ha
    & < & \zeta                      \\
   \frac{1}{\mbox{EMIN}} \lb \frac
       {\la z^{(n)},z^{(n)}\ra }
       {\la u^{(n)},u^{(n)}\ra } \rb ^\ha
    & < & \zeta                      \\
   \frac{\mbox{EMAX}}{\mbox{EMIN}} \lb \frac
       {\la z^{(n)},z^{(n)}\ra}
       {\la Q_L^{-1}b,Q_L^{-1}b\ra } \rb ^\ha
    & < & \zeta                      \\
   \lb \frac{\la z^{(n)},z^{(n)}\ra }
        {\la Q_L^{-1}b,Q_L^{-1}b\ra } \rb ^\ha
    & < & \zeta
\end{eqnarray}
 
\bigskip
Here, EMAX [= RPARM(2)] and EMIN [= RPARM(3)] are estimates of 
the $2$-norm of the preconditioned matrix and its inverse.  In the 
symmetric case, EMAX and EMIN are estimates of the maximum and 
minimum eigenvalues of $Q^{-1}A$, respectively.  Also,
\bigskip
 
\begin{tabular}{ll}
                & \\
      $Q$       & is the preconditioning matrix, \\
                & \\
      $u^{(n)}$ & is the current iterate, \\
                & \\
      $r^{(n)}=b-Au^{(n)}$ & is the current residual, \\
          & \\
      $z^{(n)}=Q_L^{-1}r^{(n)}$ & is the current
                               left-preconditioned residual, \\
          & \\
      $\tilde z^{(n)}=Q^{-1}r^{(n)}=Q_R^{-1}Q_L^{-1}r^{(n)}$ & is the current
                               pseudo-residual, and \\
          & \\
      $\bar{u}=A^{-1}b$ & is the true solution. \\
          &
\end{tabular}
 
\bigskip

    Various accelerators calculate certain vectors or inner products
automatically, allowing the stopping test to be performed very
cheaply.  A judicious choice of stopping test is necessary in order
to permit the accelerator to run most efficiently.

    Stopping test (6) requires that UBAR contain the solution to
$Au=b$.  This stopping test may be used for testing purposes whenever 
the exact solution UBAR is known.

\bigskip
\noindent
{\bf Tests for the Symmetric Accelerators}
\bigskip
\indent
 
    The symmetric accelerators CG, SI, SRCG, and SRSI can use any of
the stopping tests efficiently and can adaptively obtain the
EMAX and EMIN estimates if MAXADP [= IPARM(6)] and MINADP [= IPARM(7)]
are set to one, respectively.  Hence, stopping tests (1) through (3) 
are good choices.  Also, the inner product $\la r,\tilde{z} \ra$ 
is already computed for these accelerators, making stopping test 
(1) very cheap to compute.  If the user has estimates of EMAX or
EMIN and wishes that these estimates be used in the stopping test,
then RPARM(2) or RPARM(3) should be set to these estimates 
before calling NSPCG, and MAXADP or MINADP should be set to zero.
The SOR routine uses a specialized stopping test:
 
\begin{eqnarray*}
      \frac{1}{1-{\rm SPECR}} \lb \frac
            {\la u^{(n)}-u^{(n-1)},u^{(n)}-u^{(n-1)}\ra }
              {\la u^{(n)},u^{(n)}\ra } \rb
                          ^\ha & < & \zeta
\end{eqnarray*}
 
\noindent
where SPECR is an estimate of the spectral radius of the SOR matrix.
SPECR is adaptively computed by the SOR accelerator.
This stopping test and stopping test (6) are the only stopping tests
available for the SOR algorithm.  Also, the stopping criterion for 
SOR is initially set to a large value so that at least one Gauss-Seidel 
iteration is performed.

\bigskip
\noindent
{\bf Tests for the Nonsymmetric Accelerators}
\bigskip
\indent
 
    The safest and most economical stopping test for the nonsymmetric
accelerators
is stopping test (10), based on the norm of the left-preconditioned residual.
This calculation can usually be done cheaply.  For most accelerators it
requires at most one extra dot product per iteration, and in some cases
(e.g. IOM, GMRES) the quantity is a by-product of the iterative algorithm and
is available at no extra cost.

    In some cases it is desirable to use a stopping criterion
which bounds the relative error of the solution to the original problem.
Stopping test (3) is designed to do this.  It requires that the
accelerator be able to estimate the extremal eigenvalues of the iteration
matrix; this is presently available for the nonsymmetric accelerators CGNR,
OMIN and CGCR.  In the case of OMIN, the variable NS3 is used to denote
the size of the largest Hessenberg matrix to be used to estimate
the eigenvalues; a typical useful value is NS3$=40$.

    However, if IQLR = 2 or 3 and a stopping test involving the
right-preconditioned residual is employed (e.g. NTEST=3), the accelerator may
have to do significant extra work to perform the stopping test.  In this
case the accelerators OMIN and LANMIN, for instance, require only one inner
product per iteration.  However, GMRES takes significant extra work per
iteration if a right preconditioning matrix is present.
 
\newpage
\section{Using Reduced System Methods}
\label{rb}
\indent
 
     If matrix $A$ has either point or line property A, it can be
permuted to a red-black system:
\[
   \lp \begin{array}{cc}
              D_R & H \\
              K   & D_B \end{array} \rp
   \lp \begin{array}{c}
              u_R \\
              u_B       \end{array} \rp =
   \lp \begin{array}{c}
              b_R \\
              b_B       \end{array} \rp
\]
where $D_R$ and $D_B$ are scalar (block) diagonal matrices if a
point (line) red-black ordering was used.  The reduced system
corresponding to this matrix is:
\[
   (D_R - H D_B^{-1} K) u_R =  b_R - H D_B^{-1} b_B
\]
NSPCG allows two means of using reduced system iterative methods.
\begin{enumerate}
       \item
       The user can apply the RS preconditioners.  With these
       preconditioners, the reduced system is not explicitly
       computed, and the matrix-vector products using the reduced
       system matrix are implicitly calculated using the COEF
       matrix.  The user must first fill the integer vector P
       with a coloring vector for either point or line
       red-black ordering.  For point red-black ordering, the
       REDBLK subroutine can be used to generate the coloring
       vector:
       \begin{verbatim}
       CALL REDBLK (NDIM,N,MAXNZ,COEF,JCOEF,P,IP,NSTORE,IWKSP,IER)  
       \end{verbatim}
       The variable NSTORE takes on the same values as the IPARM
       parameter, namely
 
       \begin{tabular}{rl}
            NSTORE = $1$ & for primary format \\
                   = $2$ & for symmetric diagonal format \\
                   = $3$ & for nonsymmetric diagonal format \\
                   = $4$ & for symmetric coordinate format \\
                   = $5$ & for nonsymmetric coordinate format
       \end{tabular}
 
       P, IP, and IWKSP are integer workspace vectors of length
       N upon input.  Upon output, P contains the coloring
       vector for red-black ordering if such an ordering exists.
       IER on output will be either $0$, in which case a red-
       black coloring vector was successfully found, or $-8$,
       in which case matrix $A$ does not have point Property A.
 
       For line red-black ordering, the user can use the COLOR
       subroutine described in Section~\ref{mc} for generating the
       coloring vector P.  COLOR can also be used to generate
       the coloring vector for point red-black ordering.
 
       \item
       If matrix $A$ has point Property A, the user can explicitly
       compute the reduced system matrix and use any of the
       preconditioners in NSPCG to solve the reduced system.
       Storage demands are greater with this method since the
       reduced system matrix must be stored.  To invoke this
       method, the user simply calls the subroutine RSNSP with
       the same argument list as NSPCG.  The vector P must contain
       a coloring vector for point red-black ordering and IPERM
       must be set to $1$.  The interpretation of the parameters
       in the argument list is otherwise the same as for NSPCG.
       Note that since P and IP are used to permute the original
       system to red-black form, the user cannot apply a further
       permutation to the reduced system matrix.
\end{enumerate}
 
\newpage
\section{Using Multicolor Orderings}
\label{mc}
\indent
 
   For many iterative methods, the unknowns are updated in an order
other than the natural (or lexicographical) ordering.  The order
in which the unknowns are updated can be specified by imposing a
coloring on the mesh so that all the unknowns corresponding to nodes 
of a particular color are updated before the unknowns corresponding 
to nodes having a different color.  Typically, a coloring is chosen
so that unknowns (nodes) of the same color are decoupled, thus
allowing these unknowns to be updated simultaneously.  This
property makes multicolor iterative methods especially attractive
on vector or parallel computers.

   NSPCG allows the matrix $A$ to be reordered and permuted
according to a coloring.  The user can request that the matrix $A$ 
be permuted before iterating by filling vector P in the NSPCG calling
sequence with a coloring vector and setting IPERM [= IPARM(14)] to 
one.  A coloring is defined as the assignment to each equation of 
an integer signifying the number of the color.  If the mesh is to be 
colored with $p$ colors, vector P contains for each equation 
an integer between $1$ and $p$, inclusive.  NSPCG will number those 
equations labeled with $1$ first, followed by those equations labeled 
with $2$, and so on.  Equations labeled with $p$ will be numbered 
last.  NSPCG then permutes the matrix according to the new ordering
of the equations before the iteration begins.  After iteration has
terminated, the matrix is permuted to its original form.

   If a matrix whose mesh is colored with $p$ colors is permuted
according to the colors, the resulting matrix can be viewed as a 
$p$ by $p$ block matrix:
 
\[ \lp \begin{array}{cccc}
                 A_{11} & A_{12} & \cdots & A_{1p} \\
                 A_{21} & A_{22} & \cdots & A_{2p} \\
                 \vdots & \vdots & \ddots & \vdots \\
                 A_{p1} & A_{p2} & \cdots & A_{pp}
  \end{array} \rp \]
 
There are some restrictions on the use of permutations:

\begin{itemize}

  \item If a permutation is requested with the primary storage
        format, the allowable preconditioners are SOR6, SSOR6,
        IC6, MIC6, and RS6.  In this case, the coloring vector P 
        must be chosen so that the diagonal block submatrices 
        $A_{ii}$ are diagonal.  Thus, equations having the same 
        color cannot be dependent.  This requirement is made so 
        that vectorization of these methods is possible.  Otherwise, 
        complicated block solves would be necessary for each 
        diagonal block instead of a simple vector operation,
        and the primary storage format is not informative enough 
        to vectorize such operations.
 
  \item If a permutation is requested with a diagonal storage
        format, the allowable preconditioners are SOR7, SSOR7,
        BIC7, BICX7, MBIC7, MBICX7, and RS7.  In this case,
        the coloring vector P can be chosen so that the $A_{ii}$ 
        submatrices are banded.  If they are banded, their
        bands must be dense.
 
  \item If a permutation is requested with a diagonal storage 
        format, extra diagonals are often created.  NSPCG will
        abort unless the column dimensions of the COEF and JCOEF 
        arrays are set large enough to store the permuted matrix
        by diagonals.
     
  \item If a permutation is requested with a coordinate storage
        format, the allowable preconditioners are RICH5, JAC5,
        NEU5, and LSP5.  There are no restrictions for permutations 
        used with coordinate storage format, but permutations
        are not effective with these preconditioners.
\end{itemize}

When using the NSPCG package, the differences between ``block" and 
``multicolor" methods are as follows:
\begin{enumerate}
 \item  The name of a block preconditioner must end with a $2$ or
        a $3$, while the name of a multicolor preconditioner must
        end with a $6$ or a $7$.

 \item  A multicolor preconditioner can only be applied after a
        permutation of the matrix (i.e., vector P must be set to a
        coloring vector and IPERM [= IPARM(14)] must be set to one).
        A block preconditioner can only be applied to a matrix
        which is not permuted by NSPCG.

 \item  With a block method, the diagonal block matrices 
        $A_{ii}$ must be of equal size, which the user must
        specify with KBLSZ [= IPARM(19)].  With a multicolor
        method, the $A_{ii}$ are not required to be of equal
        size.
\end{enumerate}

As an example of a coloring, suppose the user requests that a $3$ 
by $3$ mesh with the natural ordering be given a red-black ordering: 
\[
\begin{array}{cc}
    \begin{array}{ccc}
      7 & 8 & 9 \\
      4 & 5 & 6 \\
      1 & 2 & 3
    \end{array} &
    \begin{tabular}{ccc}
      R & B & R \\
      B & R & B \\
      R & B & R
    \end{tabular} \\
       & \\
    \mbox{Natural} & \mbox{Red-Black}
\end{array}
\]
If the red unknowns are to be labeled first, then P would be
chosen as 
\[
    P = ( 1, 2, 1, 2, 1, 2, 1, 2, 1)
\]
resulting in the ordering:
 
\[
\begin{array}{ccc}
    4 & 9 & 5 \\
    7 & 3 & 8 \\
    1 & 6 & 2
\end{array}
\]
If the user wished the black unknowns to be labeled first, P would
be chosen as 
\[
    P = ( 2, 1, 2, 1, 2, 1, 2, 1, 2)
\]
resulting in the ordering:
 
\[
\begin{array}{ccc}
    8 & 4 & 9 \\
    2 & 7 & 3 