Sign In  
Navigation
 
Step 1: Model Construction

Of Predictors and Predictands

To construct a model using any of the of the regression tools available in CPT (PCR, CCA or MLR), CPT requires a set of training data. This set of training data comprises of 2 different types of data. The first one is the predictor (the variable used to predict) and the other is the predictand (the variable being predicted). As an example, a predictor can be the sea-surface temperature (SST) at a particular place or region, while a predictand can be the rainfall rate at a particular station or grid. So, assuming we have 30 years of historical training data set comprising of predictors and predictands, we can use the regression techniques to construct a model in which we can make forecasts for future rainfall rate using expected value of future SST, for example.

Simple Linear Regression

Consider the simple case where we have only one predictor for a single predictand. You can assume, for the sake of simplifying this discussion, that the rainfall over a grid covering Singapore (call this the 'y' preditand) is only affected by the sea-surface temperature over a point somewhere in South China Sea (call this the 'x' predictor) and everything else (wind, moisture, terrain) has negligible effect on the rainfall rate (of course, we know that this is hardly the case and is very far from reality).

So we have a series of historical data set comprising of the average SST temperatures and the corresponding average daily rainfall rates for the month of June from 1970 up to 2009, which comes up to a total of 30 SST-rainfall data pairs. We use this data, together with the simple linear regression technique, to come up with a line-of-best-fit model of the form:

y = ax + b

where, 'a' is a factor to multiply 'x' to, and 'b' is a constant to be determined. Once 'a' and 'b' have been determined, forecasting the average rainfall for that grid over Singapore for June 2011 is easy if you have the expected value of SST for June 2011 (value of future SST could be obtained from dynamical model output, for example). This particular technique of using future value of predictors from model output to develop models and make forecasts instead of using observed values is called Model Output Statistics (as opposed to Perfect Prognosis technique). The diagram below is an example of a scatter plot of 200 hypothetical data points and where it is also shown how the line-of-best fit (in the form of y = ax + b) might look like for this particular training data set.

(Source: http://en.wikipedia.org/wiki/File:Linear_regression.png)

Complex Multiple Linear Regression

If rainfall is that simple, needing only one predictor element, we could make use of only the simple linear regression model to make reasonable forecast of it. But it isn't. Often times, the weather is determined by a number of elements such as winds, moisture and temperature at different levels and the interactions between them. It would therefore be more realistic to consider a set predictors instead of just one. So instead of considering say just the SST at one grid point, we could consider SSTs over many grid points over the South China Sea. We are assuming that these SSTs at the same time could together influence, to different degrees, the rainfall over Singapore. The linear regression model with multiple predictors would then look like this:

y = a1x1 + a2x2 + a3x3 + ... + anxn + b

To find the "line-of-best-fit" for this type of problem requires the use of the multiple linear regression technique, which is simply a mathematical extension to many variables of the simple linear regression technique. But simply finding the line-of-best-fit for a problem as complex as rainfall prediction may cause instability due to possible collinearities between the many variables/predictors. If collinearities exist, the models can be very sensitive to noise, and forecast quality may drop as a result.

Principal Components Regression

A type of multiple linear regression technique that takes care of the problem of collinearity is the Principle Components Regression. This regression technique makes use of the Principal Components Analysis, that selects a new set of axes for the data (components). The components are selected in decreasing order of variance and that they are perpendicular to one another. The "perpendicular" condition ensures that the components selected are uncorrelated, and the "decreasing order of variance" condition ensures that only the components that matter most are selected, as the rest could just be contributing noise. The image below shows another set of hypothetical data points that consist of predictor-predictand pairs. The longer arrow shows the direction of the component which has the largest variability, while the shorter arrow shows the direction of the component which has the next largest variability. Thus, assuming that these 2 components are enough to explain most of the variability in the spread of the data, then there would be no need to consider subsequent components as there is a good chance that the remaining variabilities could just be representing noise.

(Source: http://en.wikipedia.org/wiki/File:GaussianScatterPCA.png)


©2007 National Environment Agency