Umetrics Suite Blog

Discover hidden details in your data with OPLS

January 15, 2018

In this blog post we will take a closer look at OPLS*, or Orthogonal PLS, a method to model process data. The advantage of OPLS compared to PLS is that you can uncover hidden details and get a more precise understanding of your data – all of which will help you build better predictive models of your processes.

uncover-data-opls-511282-edited.jpeg

OPLS can help you uncover the hidden details in your data.

In the broader picture, the reason why we delve into this method is that the better the model – the better the predictions and the more value you will get from your data, be that reduced production time, number of defect products, use of raw material, or any other aspect of your processes. But first a short introduction to the different methods you can use for modeling process data.

Modeling process data with PCA, PLS, and OPLS

When you are modeling process data there are different methods that you can use. PCA, the principal component analysis method, applies when you have one type of data, process input data, which is often denoted with an “X” in statistical data tables. Using PCA, you try to build a model of good process behavior, which you then validate to see if the model can separate between good or bad process conditions. The model can also be used for online monitoring to find deviations in your process and to interpret that deviation and find assignable causes.

When you have access to a second block of data, containing one or more process outputs, often denoted by “Y”, you use regression extensions of the PCA method. The methods you use are called PLS, Partial Least Squares, and OPLS, Orthogonal PLS. Using these tools you look at the process input information (X) and how it can be used to model and predict future process outputs (Y).

The regression extension is also used for batch processes. When modeling batch processes you link the initial conditions and batch evolution measurements to the different process outputs, as is illustrated in the picture.

OPLS-methods-for-modeling-process-data.png

The different methods for modeling process data.

Finding the information overlap between input and output variables

With PLS and OPLS you try to find relationships between the data tables X and Y, i.e., the information overlap in the process input data (X) and the process output data (Y). If you can establish such a model you can measure process inputs for new observations (time points), insert that information into the model and get predicted output values for the new observations (time points).

If you compare models with the same total number of components and that only have one Y variable, PLS and OPLS have the exact same predictive ability. That is, in the single-y case, the methods are equivalent in terms of predictive power. The benefits of using OPLS over PLS is much improved interpretability.

The PLS and OPLS models are known as projection techniques. This means that you try to find directions among the points of the X space that overlap with the points in the Y space. That is, PLS and OPLS will try to find information in the X space that is of relevance for predicting the Y space. This overlap, the correlation between X and Y, is illustrated by the colored dots in the picture below. As long as the colored structure dominates, PLS will work very well.

Sometimes, there can be other sources of systematic variations between the samples that are not correlated to Y. That additional systematic variation is illustrated by the gray dots. When the gray structure is larger than the colored structure, PLS will not work so well. Here is where OPLS comes into use, because OPLS is able to filter away the non correlated information and still find the colored structure.

project-based-regression-modeling-OPLS.png

OPLS works better than PLS when the non correlated information dominates, represented by the gray dots.

We are now going to look at two examples that illustrate the difference between PLS and OPLS and the extra bit of information that you can gain by using OPLS. The purpose of the examples is to give you a general idea of OPLS and how it can help you build more precise and effective models. Most of the details have been left out but they will be clarified and explained in our webinar on OPLS for process modeling.

First example

The difference between PLS and OPLS in the single-y case and how to filter away Y-orthogonal information in spectral data

Here we will look at a real case example of comparing PLS and OPLS in the single-y case and we will use OPLS to filter away Y-orthogonal information from spectral data. This is a case with a powder mix of lactose and salicylic acid where the salicylic acid concentration varies between 45-55%. We have in total 12 batches where 6 of the batches went into the training set, to build the model, and 6 into the test set, to validate our model. For each batch, 40 NIR spectra were taken; in total for all 12 batches we have 480 spectra. The PLS and OPLS methods are used to see if there is any information in the NIR spectra that is correlated to changing levels of salicylic acid amongst the training set samples.

The picture below shows the first two score vectors of the PLS and the OPLS model. All samples are colored according to the percentage of salicylic acid, from 45% (the blue dots) to 55% (the red dots). The PLS model is seen to the left in the picture below, where we are plotting the first two components. The horizontal dimension represents the first component, and the vertical dimension the second component. In the PLS model, both components are needed to fully account for the changing levels of salicylic acid.

On the right, we see the OPLS model for the same set of data. Here, only the first dimension of the model is needed. You could say that OPLS is a rotation or transformation of the PLS solution. When we go from left to right along the horizontal dimension we have increasing salicylic acid concentrations, that is, between batch variations. The vertical dimension represents within batch variation.

OPLS-model-split-of-batch-variations.png

In the OPLS model, we can identify a certain split of the within batch variations.

Using OPLS, it is easier to identify between batch variations. In the example we can for example see an evident split of the red dots into two groups in the OPLS model. We can then try to interpret the reason for this split and in SIMCA filter away that information (SIMCA has support for several spectral filters that can be combined in different orders). In this way, we can usually decrease the complexity of the model.

Second example

The difference between PLS and OPLS using the Batch Evolution Model

In the second example, we will look at a chemical reaction run in batch mode and in this case we will use the so called Batch Evolution Model. In the SIMCA software, we have two types of batch models, the Batch Evolution Model, BEM, and the Batch Level Model, BLM.

These two types of batch models complement each other and are used for different types of unfolding of the data or arrangements of the data. In the BEM, we look at a batch from start to end and make measurements over time. With this configuration of the data we are interested in monitoring evolving batches to see how they compare with good operating conditions. In the BLM model, we make a batch-wise unfolding, so all data from a batch make up one row in the data table. With this configuration of the data, we compare completed batches (see illustration in the picture below).

BEM-BLM-models-in-SIMCA-OPLS-Batch-data-over-time.png

The BEM and BLM models in SIMCA are different models of arranging batch data.

In this second example, we will look at a hydrogenation reaction where nitrobenzene will be converted to aniline. To reflect the progression of the batches, a number of process variables were measured, 80 spectral variables and 7 more conventional process variables such as reactor temperature, reactor pressure, gas-feed, and stirrer speed.

In this case, we have 6 centerpoint batches for model training and 5 batches for model testing. The 6 centerpoint batches, the normal operating condition batches, is used to develop a BEM and then the PLS and OPLS methods are used to make the predictive models.

In this example we will look at control charts were we visualize normal operating conditions. In the picture below we have batch start (time 0) and a batch end (at 45 minutes) along the horizontal axis. We have lower and upper control limits (the red lines), corresponding to +/- 3 standard deviations, and a green line that is the average trajectory across the 6 normal batches.

Here we also see one of the test set batches (the black line), a batch with both a high temperature and a high initial concentration. As we can see in both the PLS and the OPLS model, this is a fast batch, producing a high amount of product. PLS will initially find the progression of this batch to be in line with the normal operating conditions, up to about 10 minutes, whereas OPLS will directly show that this is a very different batch compared to normal operation conditions. In this case, the OPLS gives us a more precise prediction compared to the PLS model. The reason for these differences is that PLS is mixing up predictive and orthogonal variabilities, whereas OPLS is able to separate this out.

OPLS_model-fast-batch.png

The OPLS model (above) shows directly that we have a fast batch.

What these two examples illustrate is that OPLS uncovers details in your data that are not evident using PLS. OPLS can therefore help you better understand your processes and help you build better models. And as stated before, the better the model, the better the understanding and the more value you will get from your data.

For an in-depth understanding of OPLS, watch our webinar on OPLS process modeling.

Get the presentation

Want to know more? Download the full presentation here.

Get the presentation

 

Have a question?

Leave a comment or question below and we'll be happy to respond.

 

*OPLS is a trademark of Sartorious Stedim Data Analytics AB.

 

Topics: SIMCA

Lennart Eriksson

Written by Lennart Eriksson

Sr Lecturer and Principal Data Scientist at Sartorius Stedim Data Analytics