Fionn Murtagh's
Multivariate
Data Analysis Software and Resources
Page
Contents:
- Multivariate Data Analysis Software
- Where Is This Software Used?
- Various Other Programs
- Other (Very!) Important Collections of Software
- Complementary Reading in Connection with These Programs
- Multivariate Data Analysis - Background Reading
1. Multivariate Data Analysis Software
This is a collection of stand-alone routines, in Fortran (mostly) and C.
Sample data sets are available. Indications are given on how to compile, link
and run. Download the programs and run on your system. Many of these
programs were originally used on a VAX/VMS system, and were also used on Unix
(SunOS, Solaris) systems.
Please notify the author of any problems (although the programs are
provided "as is" and there are many evident improvements which could be made).
These programs are used in conjunction with the DEA course on "Analyse
multivariée en astronomie" taught by Fionn Murtagh at Strasbourg
Astronomical Observatory. The example of compiling, linking and running
given for principal components analysis (in Fortran) is similar to what is
required for the other Fortran programs here.
>Depending on your setup, these programs may have their texts run together.
Save as "source" to have the programs properly formatted.
- Principal components analysis (Fortran)
pcat.f, driver program
pca.f, routines used
spectr.dat, small sample dataset, originally
related to stellar spectra.
To compile and link: f77 pcat.f pca.f -o pcat
To run: pcat (output to screen, which may be directed to a file).
- Principal components analysis (C)
pca.c, program
To compile and link: cc pca.c -lm -o pcac
To run: pcac spectr.dat 36 8 R
Or: pcac spectr.dat 36 8 r
- Partitioning (Fortran)
partt.f, driver program
part.f, routines used
To compile and link: f77 partt.f part.f -o partt
Set up to run on spectr.dat (hard-wired in partt.f).
- Hierarchical clustering using stored dissimilarities (Fortran)
hct.f, driver program
hc.f, routines
hcass.f, cluster assignments routine
hcden.f, draw part of dendrogram
To compile and link: f77 hct.f hc.f hcass.f hcden.f -o hct
To run on spectr.dat (hard-wired): hct
- Hierarchical clustering without storage of dissimilarities (Fortran)
hcon2t.f, driver program
hcon2.f, routines
To compile and link: f77 hcon2t.f hcon2.f hcass.f hcden.f -o hcon2t
To run on spectr.dat (hard-wired): hcon2t
- Linear discrimant analysis (Fortran)
ldat.f, driver program
lda.f, routine
spectr2.dat, input dataset
- Multiple distriminant analysis (Fortran)
mdalum.f, driver program
mda.f, routine
luminosity.dat, input dataset
Additional diagnostics provided for this 3-class
supergiant - giant - dwarf dataset.
- K-nearest neighbors discriminant analysis - 2-class case (Fortran)
knn.f, driver program and routines used
ngc.dat, data - class number, followed by11
variables, for 500 objects.
The data is a set of 500 objects derived from an HST WF/PC image, in
the context of a study in star/cosmic ray hit discrimination.
To use: copy ngc.dat to ngc2.dat, to provide a second dataset. We
check assignments among the latter, relative to the assignments of objects
in ngc.dat. Hence, evidently for k=1 we must get exactly the correct
assignments in all cases. But for k=2 and higher, we may find
inconsistencies, which point to incorrect original class assignments.
Note: for k = even number, we allow for no assignment decision. I.e.
no clear majority.
Both idential or proportional prior situations are considered (i.e.
taking into account the relative numbers of class 1 and 2 memberships,
or not).
Also we run through all cases from k = 1, 2, ..., 15, which takes some
time.
To run: f77 knn.f -o knn
Then: knn (which takes ngc.dat and ngc2.dat as inputs, and produces
output on screen).
- Correspondence analysis (Fortran)
cat.f, driver program
ca.f, routines
uno.dat, input data (UNO voting)
- Classical or metric multidimensional scaling (Fortran)
cmdst.f, driver program
cmds.f, routines
cities.dat, input dataset
Currently this program generates its own random input data.
- Sammon mapping (Fortran)
sammon.f, program
iris.dat, input dataset (Fisher's iris data)
- Kohonen self-organizing feature map (Fortran)
koh.f, program
iris_norm.dat", input dataset (normalized
Fisher's iris data)
iris.f, for information, program to normalize data
1.5. Where Is This Software Used?
2. Various other programs
- Sort routine (Fortran)
sort.f
- Weighted Levenshtein or edit distance between character strings (Fortran)
levenshtein.f
- Errors-in-variables regression, 2-dimensional (Fortran)
leiv1.f, York (Can. J. Phys. 44, 1079-1086, 1966)
leiv2.f, Fasano and Vio
leiv3.f, Ripley
All these programs are set up with sample driver routines and sample data.
They should perform exactly the same task.
- Minimal spanning tree, efficient algorithm in 2 dimensions (Fortran)
mst2d.f, program due to F.J. Rohlf, SUNY
- GMDH - group method for data handling, or Ivakhnenko polynomial
method (Fortran)
gmdh.f, program
gmhd.dat, sample data
See book by Farrow on GMDH.
- Point pattern matching approaches, software, data sets.
Preliminary web
page.
3. Other (Very!) Important Collections of Software
- Statistical
codes for astronomy, Penn State University (Eric D Feigelson,
edf@astro.psu.edu).
- Pointers to, and addresses of, lots of
multivariate data analysis code, a text file collected or summarized
by F. Murtagh from mail messages or digest announcements. Look here for
whereabouts of code for: decision trees, clustering (C code),
multidimensional scaling and lots of other psychometric mapping methods,
Voronoi diagrams, etc.
- Statlib, major repository of
statistical software, datasets, and information such as email lists and
organisational addresses, at Carnegie Mellon University (Mike Meyer,
mikem@stat.cmu.edu).
4. Complementary Reading in Connection with These Programs
This is a selection of books and papers by F. Murtagh. Co-authors are
indicated. Mostly
these are astronomy-related texts, but not exclusively. Mostly they are
also related to the software methods implemented in the above programs.
- Multidimensional Clustering Algorithms, Physica-Verlag, Heidelberg,
1985 (ISBN 3 7051 0008 4).
- (With A. Heck) Multivariate Data Analysis, Kluwer, Dordrecht, 1987
(ISBN 90 277 2425 3, ISBN 90 277 2426 1).
- (With C. Jaschek) Eds., Errors, Bias, and
Uncertainties in Astronomy, Cambridge University Press,
1990. (ISBN 0-521-393000-0).
- "La carte auto-organisatrice de Kohonen: extensions et
applications", Proc. Ecole Modulad-ASU, Statistique et Méthodes
Neuronales, Y. Lechevallier et al., Eds., December 1995, in press.
- "Unsupervised catalog classification", in D. Shaw,
H. Payne and J. Hayes, Eds., Astronomical Data Analysis Software
and Systems IV, Astronomical Society of the Pacific, 264-267, 1995.
- (With M. Hernández-Pajares), "The Kohonen self-organizing map
method: an assessment", Journal of Classification, 12, 165-190, 1995.
- (With M. Hernández-Pajares), "How tracer objects can improve
competitive learning algorithms in astronomy", Vistas in Astronomy, 38,
317-330, 1994.
- (With H.-M. Adorf) "Detecting cosmic ray hits on HST
WF/PC images using neural networks and other discriminant analysis
approaches",
in V. Di Gesù, L. Scarsi, R. Buccheri, P. Crane, M.C. Maccarone
and H.U.
Zimmermann, Eds., Data Analysis in Astronomy IV, Plenum Press, New York,
103-111, 1992.
- "Cosmic ray discrimination on HST WF/PC images: object
recognition-by-example", in D.M. Worrall, C. Biemesderfer and J. Barnes,
Eds., Astronomical Data Analysis Software and Systems I, 265-273, 1992.
- "Contiguity-constrained clustering for image analysis", Pattern
Recognition Letters, 13, 677-683, 1992.
- "Multivariate analysis and classification of large astronomical
databases (with discussion)", in G.J. Babu and E.D. Feigelson, Eds.,
Statistical Challenges in Modern Astronomy, Springer-Verlag, New York,
449-474, 1992.
- "Multivariate analysis and pattern recognition methods:
a short review and some current directions", in J.D. Barrow,
A.B. Henriques, M.T.V.T. Lago and M.S. Longair, Eds.,
The Physical Universe: The Interface Between Cosmology, Astrophysics
and Particle Physics, Springer-Verlag, 253-264, 1991.
- "Multivariate methods for data analysis", in Aa. Sandqvist
and T.P. Ray, Eds., Central Activity in Galaxies, Springer-Verlag,
Berlin, 209-235, 1993.
- "Multilayer perceptrons for classification and regression", International
Journal of Neurocomputing, 2, 183-197, 1990/1991.
- (With Ph. Nobelis) "Statistical software",
in: C. Jaschek and F. Murtagh, Eds., Errors, Bias, and Uncertainties in
Astronomy, Cambridge University Press, 245-260, 1990.
- "Multivariate analysis methods: background and example",
in: W.C. Seitter, H.W. Duerbeck and M. Tacke, Eds., Large Scale
Structures in the Universe: Observational and Analytical Methods,
Springer-Verlag, Berlin, 308-314, 1988.
- "Image analysis problems in astronomy", in:
V. Cantoni, V. Di Gesù and S. Levialdi, Eds., Image
Analysis and Processing II, Plenum Press, New York, 81-94, 1988.
- "Classification problems in astronomy", in: H.H. Bock
Ed., Classification and Related Methods of Data Analysis,
North-Holland, Amsterdam, 23-32, 1988.
- (With D. Ponz) "Image processing, databases, and
statistical software: the common interface", Statistical Software
Newsletter, 12, 129-132, 1986.
- (With A. Lauberts) "A curve matching problem in
astronomy", Pattern Recognition Letters, 4, 465-469, 1986
5. Multivariate Data Analysis - Background Reading
The following contains various web resources. Some are really very good.
-
Multivariate data analysis and statistical linear models in psychology
research - a course outline. Useful are the following:
-
Annotated list of books for multivariate analysis in psychology and social
science. About 19 books. (Some of the best books are characterized as
'not standard', 'dry', 'skimpy' - these are really very good! -
e.g. Gnanadesikan, 1977; Morrison, 1990; Overall and Klett, 1972; Tatsuoka,
1988.)
- Regression
references, psychology and social science orientation. Includes
structural equations and path analysis, logistic regression, applications.
-
Annotated factor analysis bibliography, over 130 references.
-
Multidimensional scaling bibliography, a good list of essential
references.
- Online textbook, accessible from
this page, on: canonical analysis, cluster analysis, classification trees,
correspondence analysis, data mining techniques, discriminant analysis,
factor analysis, general linear models, stepwise regression,
graphical techniques, linar regression, multidimensional scaling, neural
networks, partial least squares, survival analysis, time series forecasting,
and lots more. Wow! Succinct description, large set of references.
(Often a plug for the package Statistica from StatSoft - but software
manuals are often a very good soure of background information.)
- Matrix
reference manual. Reference information about linear algebra and
the properties of matrices. Mike Brookes, Imperial College of Science,
Technology and Medicine, Electrical Engineering Department, Exhibition
Road, London SW7 2BT, England. Quite a good handy reference for various
types of matrics, and linear algebra operations.
- Correspondence
analysis - an online text.
Analyse des
correspondances - même chose en français. Comprehensive,
but uses an awful typeface.
- Ordination
methods for ecologists. Ordination methods are what are called
scaling methods in the social sciences, or dimensionality reduction methods.
A bibliography:
An annotated bibliography of canonical correspondence analysis
and related constrained ordination methods 1986-1996.
- Mixture
modeling page. Mixture modeling is cluster analysis, assuming the
clusters consist of a possibly superimposed mixture of Gaussian or other
parametric forms. The job then is to estimate the parameters of the
distribution. This page contains various other links also. Very good stuff.
6. Datasets
Datasets can be revealing... They can illustrate particular classes of
method.
Email author: fmurtagh@astro.u-strasbg.fr.
Last update: 2001-May-24.
The address of this page is: http://astro.u-strasbg.fr/~fmurtagh/mda-sw