GENERATE and LINEAR - MS-DOS utilities for processing of experimental
series with systematic errors

E.B. Rudnyi
Department of Chemistry
Moscow State University
119899 Moscow, Russia
e-mail RUDNYI@MCH.CHEM.MSU.SU

(C) 1994, All rights reserved


Purpose:      estimation of unknown parameters of the models

              yij = a + b xij + eij
              yij = a + eij
              yij = b + eij

              with use of values of several experiments containing
              systematic errors

Feature:      estimation of unknown variance components by the
              maximum likelihood method

Requirements: IBM-compatible computer with MS-DOS (version 3.3 and
              higher). The code need about 80 Kb of memory by itself,
              some memory is also needed for your data. The better
              computer the faster LINEAR works, although it should
              work even in the worst configuration.

Files in the archives:
LINEAR.EXE     - the utility to process data
GENERATE.EXE   - the utility to generate pseudo-experimental values
                 to test LINEAR
ONEWAY.CFG     - configuration file of GENERATE to imitate one-way
                 classification results
LINE.CFG       - configuration file of GENERATE to imitate linear
                 regression results
ONEWAY1.DAT    - examples of data files for one-way classification.
ONEWAY2.DAT      They have been used to draw fig. 1 in ref. [1]
ONEWAY3.DAT
ONEWAY4.DAT
LINE1.DAT      - examples of data files for linear regression.
LINE2.DAT        They have been used to draw fig. 2 in ref. [1]
LINE3.DAT
LINE4.DAT
README.TXT     - this file

[1] Rudnyi, E.B. "Combined processing of experimental series with
    systematic errors - non-linear physico-chemical model with linear
    error model". Presented at InCINC'94, The First International
    Chemometrics InterNet Conference, 1994. See file EBRDOC.PS
    (postscript format) or EBRDOC.TXT (plain ASCII format). If you
    can not find them, contact me and I will tell you the site from
    which it is available by anonymous ftp.



LICENSE

This program is freeware (public domain). Feel free to use and
distribute it, provided no charge is taken. I will be glad if you
like this program. Let me know if you find any bugs. I would also
appreciate your comments.

Disclaimer of warranty:

This program is supplied as is. I disclaim all warranties, express or
implied, including, without limitation, the warranties of
merchantability and of fitness of this program for any purpose. I
assume no liability for damages direct or consequential, which may
result from the use of this program.



CONTENTS

1. Introduction
2. Reference for the utility GENERATE
3. Reference for the utility LINEAR
     3.1. Command line
     3.2. Data file
     3.3. Configuration file
     3.4. Output file
4. Limits within the realization


LIST OF SYMBOLS

If not mentioned otherwise, greek means small greek letter.

Sumj = capital_greek_sigma       summation over j
                          j

sqrt                             square root

(+/-)                            plus-minus

yij = y                          experimental observation
       ij

i = 1, ...,  M                   index enumerating series

M                                number of all the series

j = 1, ..., Ni                   index enumerating points in the i-th
                                 series

Ni = N                           number of points in the i-th series
      i

SS                               general sum of squared deviations


L                                function, the maximum of which
                                 coincides with the maximum of the
                                 likelihood function

eij = greek_epsilon              difference between the experimental
                   ij            and calculated values

er_ij = greek_epsilon            reproducibility error in the yij
                     r,ij

ea_i = greek_epsilon             shift systematic error in the i-th
                    a,i          series

eb_i = greek_epsilon             tilt systematic error in the i-th
                    b,i          series

                   2
sr_i2 = greek_sigma              variance of reproducibility errors in
                   r,i           the i-th series

                   2
sa_i2 = greek_sigma              variance of shift systematic error in
                   a,i           the i-th series

                   2
sb_i2 = greek_sigma              variance of tilt systematic error in
                   b,i           the i-th series

ga_i = greek_gamma    = sa_i2/sr_i2
                  a,i

gb_i = greek_gamma    = sb_i2/sr_i2
                  b,i

Qi = Q                           set of series when sr_i2 = sr_2
      i

Qa = Q                           set of series when ga_i = ga
      a

Qb = Q                           set of series when gb_i = gb
      b

xij = x
       ij

xi = x  = (Sumj xij)/Ni          mean of x in the i-th series
      i

                         2
Pi = P  = Sumj (xij - xi)
      i



1. INTRODUCTION

This software was written to accompany my paper presented at
InCINC'94 (see ref. 1 above). Its purpose is to demonstrate the
opportunities of the approach described in the paper. However, this is
not a demonstration software, it can be used for solving real
problems.

It would be better if you start by reading the paper [1] (at least
sections 1 and 2). Then you will know better the notation and the
main idea.

The utility LINEAR is designed to estimate unknown parameters in
three models

     yij = a + b xij + eij
     yij = a + eij
     yij = b + eij

from experimental values of several experiments (note that different
experiments can be treated under different models). It is assumed
that the errors behaviour can be described as

     eij = er_ij + ea_i + eb_i (xij - xi)

what I call the linear error model. Such an error model simulates not
only the reproducibility errors er_ij but also the shift ea_i and tilt
eb_i systematic errors. Details can be found in ref. 1.

You need nothing but LINEAR.EXE to process your data. However, to
make more fun, the utility GENERATE was also created. It uses a
random generator to obtain pseudo-experimental values of several
experiments with systematic errors. Such values are written in a file
which can be processed by LINEAR. Thus, it allows you to check the
approach by employing it to the case where the answer is known (On
this way, fig. 1 and 2 in ref. 1 was created).

Enjoy and have fun!



2. REFERENCE FOR THE UTILITY GENERATE

The utility generates pseudo-experimental values of several
experiments containing systematic errors with structure of one-way
classification or linear regression. The values are scattered
according to the normal distribution.

The format of a command line is

GENERATE conf_file [out_file sga sgb]

  conf_file - a name of the file (by default extension .CFG), where
              the description of the task should be resided. This is
              a plain-ascii file and it can be created and edited
              with any editor (save results as text only). The files
              ONEWAY.CFG and LINE.CFG will give you information enough
              what such a file should looks like.

  out_file -  a name of the file (extension .DAT is ascribed, default
              name NUL) for results of the generation. This is a
              plain-ascii file but he is written in such a format
              that can be processed by LINEAR.

  sga -       a value of sqrt(ga) (default zero)

  sgb -       a value of sqrt(gb) (default zero)

A value from the file GENERATE.RND is employed as a seed for the
random generator. It changes after each call of GENERATE. If such
file does not exist the BIOS timer is used to make a seed value for
the first time.

GENERATE also makes output to the screen. First, the instances of
systematic error goes and then pseudo-experimental values rounded to
integers.



3. REFERENCE FOR THE UTILITY LINEAR

The utility works in the batch mode only - it takes experimental
values from the file and creates another file of results.


3.1. Command line

The format of a command line is

LINEAR [-fl1 -fl2 ...] data_file [conf_file]

The values of flags are as follows
    -o    to set a name to the output file,
    -h    to obtain short help,
    -l    to obtain license information,
    -p    output file contains final results only (default),
    -p1   output file contains also results of main iterations,
    -p2   output file contains results all the iterations
          (free space on the disk),
    -s    automatic choice for displaying parameters (default),
    -s1   output only free parameters,
    -s2   output all the parameters.

By default the data file (data_file) has extension .DAT. If there is
no name for the configuration file (conf_file is absent) LINEAR tries
to find the configuration file with the name of the data file and the
extension .SET.

If the flag -o is absent, LINEAR takes the name of the configuration
file or the data file, changes the extension to .LST and makes the
result as a name of the output file.


3.2. Data file

A data file is written in the free format. White space is recognised
as the word delimiter. The file consists from the series separated by
semicolon. Each series comprises following fields separated by
commas

series_name,
equation_name,
variables_names,
point1,
point2,
...;

     series_name -     this is any identifier (list of symbols
                       without space, comma or semicolon).

     equation_name -   one of three words - line, justa, justb -
                       describing the next models

                       yij = a + b xij + eij
                       yij = a + eij
                       yij = b + eij

     variables_names - one or a few identifiers separated by space.
                       Their number determines a total number of
                       values in each point. The identifiers
                       themselves are not used but in the output.

     point1 -          one of a few values separated by space. The
                       number of values to read is equal to the
                       number of names of variables. For the equation
                       line, first two values from the point are
                       used, the first as y, the second as x. For the
                       equations justa and justb, the first value is
                       taken from each point.

If there are a few words in the field series_name or equation_name,
the first only is taken and all the others are ignored. If there are
more values in the field pointN than the number of the names of
variables, the extra values are ignored. If the number of values is
less than that, the absent are initialized by zeros. Such rules
permit you to write comments in the fields series_name, equation_name
and pointN.

While reading the utility LINEAR sorts series by alphabet.

You can easily exclude either a series or a point from processing. To
this end, the symbol * can be placed in the data file. To exclude a
series - put the symbol * before the name of the series, to exclude a
point - put the symbol * in the beginning of the field pointN. You
also are able to exclude a series in the configuration file.

Although points and series marked with * don't take place in the
calculations, they will be presented in the output file, and also,
their deviates and variance components will be estimated.

The files ONEWAY?.DAT and LINE?.DAT are the examples of the data
files to be processed by LINEAR. Also, each output file of the
utility GENERATE can be viewed as such an example.


3.3. Configuration file

The configuration file is optional. It is for experienced users, you
must have read ref. [1] before you start creating your own
configuration file.

If configuration file is absent, the utility LINEAR makes one
calculation with default hypotheses - all the series are assumed to
have the same reproducibility variances sr_i2 = sr_2 and the same
quantities ga_i = ga and gb_i = gb. It is a good start for many
applications.

The configuration file is written in the free format and contains the
descriptions of the series and the description START. These
descriptions must be finished by semicolon.

The LINEAR reads a series description, modifies the hypotheses
accordingly and continue reading with the next series. When the
description START appears, the utility starts the calculation. After
that, the process of reading of the configuration file continues.

The format of the description START is

START [[*] par_name init_val [, ...]];

  * -        this symbol, if present, means that the parameter will
             be kept constant during the maximisation of the
             likelihood in this and next calculations (until
             redefining).

  par_name - a name of the parameter (a or b only).

  init_val - a value to be used as initial. If absent, the value from
             the previous calculation is taken or zero in the first
             calculation.

All the symbols after the initial value until comma (semicolon) are
ignored. You can put a comment there.

The format of a series description is

[*] ser_name, hyp_fl sri, hyp_fl sga, hyp_fl sgb;

  * -        this symbol, if present, means that the series will be
             ignored in this and next calculations (until
             redefining).

  ser_name - a name of the series. Again, all the words after the
             first will be ignored until comma.

  hyp_fl -   a hypothesis flag - one of three characters #, % or *.
             The character # means that this variance component
             belongs to the set with the same variance (the default
             hypothesis), the character % shows that this variance
             component will drift apart and the character * makes the
             variance component constant (it won't change in the
             maximization procedure).

  sri -      an initial value of the standard deviation of
             reproducibility.

  sga -      an initial value of sqrt(ga_i).

  sgb -      an initial value of sqrt(gb_i).

If a hypothesis flag is absent, the LINEAR takes one from the
previous calculation (in the first calculation - the default
hypothesis). If an initial value (sri, sga or sgb) is absent, the
LINEAR takes one from the previous calculation (in the first
calculation - the default value sri = 1, sga = 0 or sgb = 0).


3.4. Output file

If a flag -p1 or -p2 is put on the command line, the output file
starts by intermediate iteration results. Please, read ref. 1 to
understand them.

The final results separated on sections.

a) The convergence condition.
  The number of big iterations.
  difL - the relative difference between the value of L in two last
         iterations.
  difv - the maximum relative difference of variance components in
         two last iterations.
  L    - the values of the function L.
  SS   - the value of the generalized sum of squares.

b) The parameter estimates and their dispersion matrix.
  After (+/-) the standard deviation is given. Standard deviations
  and dispersion matrix is obtained for free parameters only.

c) The estimates of the variance components.
  ID - the series name.
  eq - the equation name.
  sr - the standard deviation of reproducibility.
  sga - a value of sqrt(ga_i).
  sgb - a value of sqrt(gb_i).

  Before values of sr, sga and sgb, there are symbols displaying the
  hypotheses used (see above).

  If a series did not take place in the processing, a symbol
  * is put before its name.

d) The analysis of deviates (see section 4.1 in ref. 1)

                         2
  av_dev = sqrt{(Sumj eij )/Ni}
     the average total deviation.

                        2
  sri = sqrt{(Sumj er_ij )   /Ni}
                          min
     the average reproducibility deviation.

  err_a = ea_i
     the shift over fitting equation.

  err_b = eb_i
     the tilt over fitting equation at mean value of x.

  err1_b = eb_i sqrt(Pi/Ni).

e) Some series values.

  Ntot - the total number of the points.
  Ns - the number of points took part in the processing.
  xav - the mean value of x.
  Ps = sqrt(Pi).

d) Experimental values by series and their deviates.

  err_full - the total deviation from the fitting equation.

  err_a -    the deviation from the equation shifted by ea_i over the
             fitting equation.

  err_ab -   the deviation from the equation shifted and tilted over the
             fitting equation.

If a point did not take place in the processing, a symbol * is put
before it.



4. LIMITS WITHIN THE REALIZATION

a) The models can not be changed. If you like the approach and want to
apply it for non-linear models, try to contact me.

2) The upper limit is set for values of ga and gb. ga can not be more
than 10000, gb can not be more that 10000000.

3) The convergence conditions can not be change.


