The linear model y=Xb + f is often used in chemometrics. Unfortunately, we have put too much emphasis on the importance of the regression coefficient profile b, and mistakenly assumed or believed that it should be a good estimate of the "pure profile". This is not true, and I will argue why in later editorials. But this time, I want to address an earlier question that was posed in relation to the regression coefficient profile.
Since the regression coefficient profile b (see equation given above) changes for each PLS component added to the model. Is it then possible to use that fact when selecting the number of PLS components in the model? This was a question posed by Yang Kui.
Fact: The regression coefficient vector b predicts y based on the modelled variation of the X matrix. 1.) First and foremost, to be successful for prediction of y from X, all systematic variation in X related to y needs to be captured in the PLS model (otherwise loss of predictive ability) .
2.) BUT the coefficient vector also needs to "cancel out" (i.e. be orthogonal to) the Y-orthogonal variation in X which has been captured by the PLS components, otherwise it will not be successful for prediction. In other words, the regression coefficient vector is contravariant (an often forgotten property).
The 2.) property is the main reason for the "erratic" behaviour of the regression coefficient profile. When components are added to the PLS model, more variation in X is captured which is not relevant for y (y-orthogonal)and this in turn creates a more "erratic" coefficient profile as it is modified (rotated to "cancel out" the y-orthogonal variation) in order to maintain predictive ability. Now, can one use the "erratic behaviour" of the coefficient profile to determine the number of PLS components. Not really, unless you have a good knowledge of the system you are modelling. The reason for this is that the coefficient profile will be rotated and made orthogonal to not only the "random noise" but also to all variation in your PLS model that is Y-orthogonal. That latter part will always exist as part of your PLS model(e.g. baseline in NIR spectroscopy, unknown analytes, etc...), unless you have designed your experiments well. The resulting shape of the coefficient profile is not easily understood and hence can not simply be used for component selection.
My advice is to instead look at each PLS component loading vector individually and select the number of components based on its shape and profile. One problem with this is that PLS mixes both the y-related and y-orthogonal variation together in each PLS component. This may make it more difficult to determine the number of PLS components. Instead, there is a recent method called OPLS that may help you here, [Trygg J, Wold S. Orthogonal projections to latent structures, OPLS. J. Chemometr., 2002; 16: 119-128]. OPLS is a modification of PLS that separates the systematic variation in your model of X into two parts. One part which is related to y and another part which is not related (y-orthogonal). What this means is that each of the y-orthogonal component loadings will only show the shape and profile of y-orthogonal variation in X (and not the y-related), and based on that you can decide when less structured variation and more random noise is being included in your model.
best regards,
Johan Trygg
|