First, though, we discuss past work, which leads to a range of conclusions on the type of data and problem domain which needs to be matched up with a corresponding appropriate pattern recognition or signal processing analysis method.
The first problem was to forecast ambient temperature, so that the telescope enclosure could be kept at a constant temperature by feedback to heating or, more likely, cooling systems. Our results were very good, based on simple nearest neighbor prediction, using an exogenous variable also (pressure, related to changes in meteorological frontal systems).
The second problem was exceptionally difficult. We sought to nowcast astronomical "seeing" (observational quality, measured as the size of the blur produced by a distant star with no diffraction spikes) using a range of ambient environmental measures. The difficulty of the problem arose, firstly, from data being often missing, due to instrumental down-times. Appropriate time and spatial resolution scales for the analysis were proposed by domain experts (Marc Sarazin's role is to test all aspects of candidate sites for astronomical instruments). A much greater difficulty was whether there was any dependence at all between seeing and ambient conditions. In fact, we also carried out in-depth analysis of radiosonde balloon-borne ascent data, to no avail. The radiosonde data suffered enormously from being driven by wind and so not tracing out a vertical column; and having a varying height before the balloon burst and the data feed stopped. (My own view is that the only data really worth having, in order to analyze vertical air columns, would be from atmospheric tomography based, for instance, on GPS satellites and ground stations. But this won't work... because astronomical observatories are located in parts of the world with minimal GPS ground station coverage.) At any rate, the overall objective was to predict seeing so that if more mediocre conditions were expected, then less high quality observing tasks could be scheduled, and appropriate instrumentation set up. The abysmally bad quality data used meant that we restricted most of our work to nowcasting.
Motivated by our use of nearest neighbor prediction in the temperature studies, in the seeing work we used a more sophisticated methodology to seek patterns and dependencies in the past. We used correspondence analysis (see, e.g., Murtagh and Heck, 1985, and many other works), using fuzzy coding of data, and developed a possibility theory framework for the interpretation of the results. The motivation for a possibility theory framework was that if the dependencies were patently very low, then could we at least find some general dependencies, in particular past time periods, which allowed good seeing to be associated with simultaneous low relative humidity, high pressure, and low temperature (for example)? The results were of interest - we found potentially useful patterns from the historic data - but the study design did not allow for any thoughts of operational use.
When univariate or multivariate time series could be assembled, we used a dynamic (i.e. unfolding in time) and recurrent (for very high order modeling) trainable neural network (DRNN) for modeling and prediction. The big advantage of the DRNN lies in the possibility to find extremely parsimonious connectionist models. Thus, a network model with 4 nodes could potentially replace a far larger multilayer perceptron and all sorts of headaches associated with the latter's architecture and training algorithms. A by-product of a highly parsimonious network is the possibility of modeling, simply, chaotic behavior (and chaotic behavior was surmised for astronomical seeing), i.e. finding a simple chaotic relationship.
Work on this theme is described in Aussem et al. (1994, 1995, 1996), Murtagh et al. (1995), and Murtagh and Sarazin (1997).
While predictions were carried out in these JCIF papers, the major interest was in feature selection. Preprocessing using wavelet transforms was used for this. The reasons for choosing wavelet transforms are simply the following: environmental and financial phenomena are taken as the superimposition of a range of (scale- and frequency-dependent) components. Many different wavelet transforms can be used, or for that matter non-wavelet multiscale transforms. In the following sections, we will look at why we use particular transforms in preference to others. Our choice of certain methods is unique compared to the work of others (e.g. the limited options supported by Cornice Research), and many alternatives were investigated also.
Our approaches to feature selection have also been innovative, with a range of approaches appraised in Aussem et al. (1998) and Zheng et al. (1999). Noise modeling of signals of very varying origin is something which MR/1 and MR/2 handle uniquely. The entropy-based framework for noise-finding and removal, described in Starck et al. (1998), Starck and Murtagh (1999, 2000), is also highly innovative and we will describe the most salient aspects of this work below.
Our requirement for handling time-varying data signals is as follows. Consider a financial index (for example), with 1000 values. We wish to know the 1 or more than 1 subsequent values. So to help our extrapolation into the future, we carry out a wavelet transform on our values x(1) to x(1000). Keep the last values of our wavelet coefficients, at time-point t=1000. This is the most critical value for our prediction.
At time-point t=1001, do the same, keeping just the last set of wavelet coefficients corresponding to this time step, t=1001. Continue doing this up to t=2000. Good. This is precisely how we must carry out the wavelet transform in real life. We have data up to time t=1000, so we can work on this. Time t=1001 comes, and so we can now work on that. And so on.
But is this a wavelet transform? Alas, no, for most wavelet functions. To see this compare what we have obtained with a wavelet transform on the data set x(1) to x(2000). They are not the same. So our sliding time window approach, which is our only option in facing a real data stream, produces a nice but irrelevant result.
Figure 1: Our problem: shown is the last smoothed version of the data (Dow Jones Industrial Averages) with an à trous wavelet transform, and the corresponding curve from the "sliding time window" approach starting from the 500th data value onwards.
Figure 2: Again the same problem: shown is the 5th scale, from a 6-scale analysis, using the à trous wavelet transform on the DJIA data, full data set and "sliding time window" starting from the 500th data value onwards.
In Figs. 1 and 2, the scaling (or low pass, or smoothing) function used is the B3 spline. In the à trous algorithm, this yields a wavelet function which is quite similar to a Mexican hat function or the Morlet wavelet function.
Using the full data set leads to an unhistorical output (which is fine if we are not interested in the time evolution of our signal). Using the real-time analysis approach, the "sliding time window" leads to... who knows what?
In Zheng et al. (1999) we developed a new transform termed the "Haar à trous" wavelet transform. This is, firstly, because the wavelet function and scaling function used are just like in the Haar wavelet transform (respectively, a step function and a box function), which goes back to work by Haar in the early part of the 20th century. Secondly, this new transform is redundant, and is implemented using the à trous algorithm. It does a fine job for us in respecting the all important end of the signal. Real-time implementation leads to the same result as directly applying this transform to all of the data.
Figure 3: Analogous to Fig. 1, only this time using the Haar à trous wavelet transform. Beyond time-point 500, the curves are the same.
Figure 4: Analogous to Fig. 2, only this time using the Haar à trous wavelet transform. Beyond time-point 500, the curves are the same.
Figure 5: DJIA data used.
The set of resolution levels produced by the Haar à trous wavelet transform are shown in Fig. 6, and the corresponding resolution scales for the B3 spline à trous wavelet transform are shown in Fig. 7. As we have seen the former (seen in Fig. 6) is consistent with a real-time implementation. The latter (seen in Fig. 7) is useful for an unhistoric analysis since it is symmetric, unlike the more jagged Haar à trous transform.
Figure 6: Haar à trous wavelet transform resolution levels.
Figure 7: B3 spline à trous wavelet transform resolution levels. Let us look at whether there is a lag bias introduced by the wavelet transform. Fig. 8 shows the last smoothed scale overplotted on the input data, in the case of the Haar à trous wavelet transform. Fig. 9 shows the corresponding last smoothed scale overplotted on the input data in the case of the ("unhistorical") B3 spline à trous wavelet transform.
Figure 8: Original data, with Haar à trous wavelet transform last smoothed scale overplotted.
Figure 9: Original data, with B3 spline à trous ("unhistoric") wavelet transform last smoothed scale overplotted.
Fig. 8 looks like it tracks the input data less well than the symmetric wavelet used in Fig. 9. This is confirmed by calculating the totaled squared discrepancy between the two curves in both cases. But wait! If we look at at more wavelet scales, which together provide a better approximation of our original data, then we find a better result. Fig. 10 shows this.
Figure 10: Original data, with Haar à trous wavelet transform fit overplotted. The latter consists of the last smoothed scale, plus the next most smooth wavelet scale (the 5th), plus the next (the 4th).
From this we conclude that a good fit to the data is possible, but on condition that only the most high frequency wavelet scales are removed. This, therefore, is one approach to filtering.
As a parting remark, we note also that if smoothing (implying excellent fit to the data) is what the user wants, then attention should be paid to the millions of person-years of work in this area in the statistical literature.
Traditionally information and entropy are determined from events and the probability of their occurrence. Our innovation is to claim that in real-life, we must always take signal (the substance of what we are dealing with) and noise into account. Signal and noise are basic building-blocks of the observed data. Instead of the probability of an event, we are led to consider the probabilities of our data being either signal or noise. We proposed therefore (Starck et al., 1998, Starck and Murtagh, 1999, Starck et al., 2000) a new definition of entropy, satisfying a number of properties which include:
For Gaussian noise we continue in this direction, using Gaussian probability distributions, and find that the entropy, H, is the sum over all positions, k, and over all scales, j, of (wj(k)2)/(2 sigma2 j) (i.e. the coefficient squared, divided by twice the standard deviation squared of a given scale). Sigma, or the standard deviation, is the (Gaussian) measure of the noise. We see that the information is proportional to the energy of the wavelet coefficients. The higher a wavelet coefficient, then the lower will be the probability, and the higher will be the information furnished by this wavelet coefficient.
Our entropy definition is completely dependent on the noise modeling. If we consider a signal S, and we assume that the noise is Gaussian, with a standard deviation equal to sigma, we won't measure the same information compared to the case when we consider that the noise has another standard deviation value, or if the noise follows another distribution.
Returning to our example of a signal of substantive value, and a scrambled version of this, we can plot an information versus scale curve (e.g. log(entropy) at each scale using the above definition, versus the multiresolution scale). For the scrambled signal, the curve is flat. For the original signal, it increases with scale. We can use such an entropy versus scale plot to investigate differences between encrypted and unencrypted signals, to study typical versus atypical cases, and to differentiate between atypical or interesting signals.
In filtering, two aspects are perhaps paramount: firstly, a very good fit between input data and filtered data, and secondly the removal of as much noise as possible. The first criterion can be expressed as minimizing the entropy of the difference of input data and filtered data. The second criterion can be expressed as minimizing the entropy of the filtered data. (Why? Because noise-free data has low entropy.) So far so good. The next step makes our job less easy. We may well want to trade off quality in the first of these two criteria relative to the second, or vice versa. The correct balance between the criteria is not clear cut.
In Starck and Murtagh (1999) a whole range of novel approaches to facing this trade-off were discussed. One of the most esthetic is the following. Let us vary our trade-off parameter such that the final value of criterion 1 (remember: entropy of difference of input data and filtered data) corresponds faithfully to the noise in our data. By ensuring this consistency, which amounts to a constraint on the optimization problem we have to solve, we are formulating the problem as a "closed" one, without recourse to any user-specified threshold.
How do we know what the noise is in our data? We may know our data well enought to answer this. (This will certainly apply to the case of instrumentation in the physical sciences.) Starck and Murtagh (1998) provide answers in regard to automatic procedures for noise estimation. Finally, we can visually and otherwise validate our results (e.g., to being with, by plotting the difference between filtered data and original data).
As mentioned, to filter noise we really must take multiple resolution levels into account. But how we do this is not important. We want some good decomposition of the data, and we want to recreate the data following noise removal. We can be as "unhistoric" as we wish. The symmetric B3 spline à trous transform performs very well. Any asymmetry in the wavelet function will serve to emphasize discontinuities, and this we believe is not desirable in this instance. We use a noise model which takes the noise as non-stationary (hence a heteroskedastic model) and additive. Fig. 11 shows, left, from top to bottom the input data, the filtered data, and both overlaid. Fig. 11, right, shows 100 values close-up.
Figure 11: Left, from top to bottom - original, filtered and both superimposed. Right, close-up of time-points 1000 to 1100.
[To be continued.]
mr1d_trans, mw1d_filterwere used.
For filtering,
mw1d_filter -m 5 djindus.fits djiaout.fits
Plots were produced in IDL.
For the "shifting window" analysis, we ran
mr1d_transrepeatedly in IDL:
; .run mr1d_trans ; .run delete ; a0 = readfits('djindus.fits') ; .run run_mr1d_trans_seq ; run_mr1d_trans_seq, a0 ; a0run = fltarr(1317, 6) ; run_mr1d_trans_seq, a0, a0run PRO run_mr1d_trans_seq, a0, a0run n = n_elements(a0) n0 = 500 for i = n0-1, n-1 do begin mr1d_trans, a0(0:i), a0out, OPT="-t 5 -n 6" for j = 0, 5 do begin a0run(i,j) = a0out.coef(i,j) endfor endfor return end
p = mr1d_predict(a(0:999), npredict=20, ncoef=2, LoScale=3) plot, a(1000:1019) oplot, pNote: use ncoef = 2 or 3, and LoScale = 3, or better 4 or 5 (for NScale = 5).
;+ ; NAME: ; MR1D_PREDICT ; ; PURPOSE: ; Considering a temporal signal S with n measures (S[0..n-1]), ; MR1D_PREDICT estimates (or predicts) the next values S[n .. n+dt]. ; A multiscale transformation is first applied, followed by filtering ; in the wavelet space. At each scale, a predictor is then applied, ; and the predicted signal is obtained from the reconstruction of ; the predicted signal. ; ; CALLING: ; Result = MR1D_PREDICT(Signal, wave=wave, PredWave=PredWave, ; NPredict=NPredict, Nscale=Nscale, ; OPT=OPT, NCoef=NCoef, LoScale=LoScale) ; ; ; INPUT: ; Signal: 1D array; input signal ; ; KEYWORDS: ; wave: 2D array; filtered wavelet coefficient ; ; PredWave: 2D array; Predicted wavelet coefficient ; ; NPredict: number of values to predict. Default is one. ; ; Nscale: number of scales used for the wavelet transform ; ; NCoef: Number of coefficients used for the prediction. Default is 3. ; ; OPT: string which contains the differents options accepted by the ; mw1d_filter C++ program. ; ; LoScale: lowest scale (i.e. highest frequency) to start from ; ; ; OUTPUTS: ; Result: IDL 1D array of size equal to NPredict. Predicted values. ; ; ; EXTERNAL CALLS ; mw1d_filter (C++ program) ; ; EXAMPLE: ; > p = mr1d_predic(S) ; > print, p ; calculate the first predicted value following S ; ; > p = mr1d_predic(S, NPREDICT=3, NCOEF=5) ; calculate the three first values following S using 5 ; wavelet coefficients ; at each scale for the prediction. ; ; > p = mr1d_predict(s(0:499), NPredict=100, Ncoef=4, Nscale=6, LoScale=5) ; predict 100 values (not a good idea with such a polynomial fitting ; method!), based on 4 wavelet coefficients, and using only the ; last (smoothest) wavelet scale and the smoothest (residual) scale ; (scales 5 and 6, indicated by LoScale=5). Hardly surprisingly ; for polynomial fitting, we find that the polynomial order (Ncoef-1) ; should be small, and LoScale should be high - thereby using only ; a highly smoothed version of the data. ; ; HISTORY: ; Written: Jean-Luc Starck 1998. ; November, 1998 File creation ; Updated, November 1999, to include LoScale, F Murtagh ;- function mr1d_predict, Signal, wave=wave, PredWave=PredWave, NPredict=NPredict, Nscale=Nscale, OPT=OPT, NCoef=NCoef, LoScale=LoScale Rec = -1 if N_PARAMS() LT 1 then begin print, 'CALL SEQUENCE: Pred = mr1d_predict(Signal, NPredict=NPredict, OP T=Opt, NCoef=NCoef, LoScale=LoScale' goto, DONE end if not keyword_set(NPredict) then NPredict = 1 if not keyword_set(NCoef) then NCoef = 3 if not keyword_set(Nscale) then Nscale = 5 if not keyword_set(LoScale) then LoScale = 1 if LoScale GT Nscale then begin print, 'LoScale cannot be great than number of scales' goto, DONE end ; Nx = (size(Signal))[1] sss = size(Signal) Nx = sss(1) print, 'npix = ', nx OPT_SCALE = '-n '+ strcompress(string(Nscale),/REMOVE_ALL) OPT_TRANS = '-t 5 '+ OPT_SCALE NameSig = 'xx_signal.fits' NameOut = 'xx_out.fits' NameMRFILTER = 'xx_mrfilter.fits' writefits, NameSig, Signal if not keyword_set(OPT) then OPT = " " OPT = OPT + OPT_SCALE + " -w " + NameMRFILTER + " " com = 'mw1d_filter ' + OPT + NameSig + ' ' + NameOut ; print, com spawn, com Filter = readfits(NameOut, /silent); Wave = readfits(NameMRFILTER, /silent); ;help, wave ResultPred = dblarr(NPredict) ResultPredErr = dblarr(NPredict) TabWavePred = dblarr(NPredict, Nscale) TabWavePredError = dblarr(NPredict, Nscale) TabScaleCoefX = double(indgen(NCoef)) TabScaleCoefY = dblarr(NCoef) TabScalePredX = double(indgen(NPredict)+NCoef) ;print, TabScaleCoefX ;print, TabScalePredX TabScalePredY = dblarr(NPredict) Step=1 ; for i=LoScale-1,Nscale-1 do BEGIN for j=0,NCoef-1 do BEGIN Pos = nx-1-j*Step if (Pos LT 0) then Pos = 0 TabScaleCoefY(NCoef-j-1) = Wave(Pos, i) END for j=0,NPredict-1 do BEGIN ; interpolation des donnees ; TabScaleCoefX = position des donnees ; TabScaleCoefY = valeurs des coeff ; TabScalePredX = position des valeurs a predire ;TabScalePredY = interpolate(TabScaleCoefX, TabScaleCoefY, TabScalePredX) Tx = TabScaleCoefX Ty = TabScaleCoefY Px = TabScalePredX polint, TabScaleCoefX, TabScaleCoefY, TabScalePredX(j), p, pe ; print, 'Pos = ', TabScaleCoefX ; print, 'Val = ', TabScaleCoefY ; print, 'P = ', p TabWavePred(j,i) = p TabWavePredError(j,i) = pe ;print, ' Scale, Predicted seqno., pred. error: ', i,j,pe END Step = 2*Step END ;reconstruct the predicted signal mr1d_paverec, TabWavePred, ResultPred PredWave = TabWavePred ; reconstruct the predicted error mr1d_paverec, TabWavePredError, ResultPredErr ; ; Rec = fltarr(Nx + NPredict) ; Rec(0:nx-1) = Filter ; Rec(nx:nx+NPredict-1) = ResultPred Rec = ResultPred DONE: return, rec end