When to apply OPLS-DA vs PCA for metabolomics and other omics data analysis
Do you know when to use OPLS-DA and when to use PCA/SIMCA data analysis techniques? Find out how to uncover the differences in your data with these classification and discriminant analysis methods.
In many fields of life science today, data analysis involves defining the differences between groups of data or interpreting group differences in meaningful ways. Finding meaning in -omics (e.g. genomic, proteomic or metabolomic) datasets often requires the use of multivariate data analysis (MVDA) techniques. Depending on the objective, you may choose different techniques. The basic methods are:
- principal component analysis (PCA) for data summary/overview
- partial least squares (PLS) and orthogonal PLS (OPLS) for regression analysis, or O2PLS for data fusion
The SIMCA method, based on disjoint principal component analysis (PCA), offers some components of each, but allows you to target either classification or discriminant analysis data analytical objectives. You can choose the method that works based on your goals.
So when should you use each data analysis technique?
To extract useful information from -omics data, you may want to use the SIMCA method (soft independent modeling of class analogy). SIMCA is suitable when you have many groups (‘classes’), some of which may overlap, and where the number of samples in the groups (‘classes’) may vary considerably.
With this approach your classes can be very strictly defined and visualized in the plots of the models. But it doesn’t really tell you why the classes are different. Instead, it tells you how to group the similar subsets of samples/observations together and where to draw the borders of the different class models.
SIMCA is based on PCA
- Useful for many groups
- Class borders are strictly defined
- Does not show why classes are different
Discriminant analysis methods involve a completely different mind-set. Here you want to know why the classes are different. The models are easiest to interpret when limited to a two-group comparison problem because you have fewer components in these models. It’s preferable that the groups are tight and contain a fairly similar number of members, although that is not a requirement. But it will make your discriminant modeling more reliable.
- Based on PLS/OPLS
- Useful for knowing why 2 groups differ
- Both groups must be “tight”
With SIMCA classification, you have samples that belong to different classes and for each class you are calculating a local principal component analysis with a full set of parameters of the PCA model. So for every class you will make local models, and then you can classify observations in the prediction set by nearness into one of the class models.
Conceptually what you have are class models for different types of samples and you have defined an envelope around the data points – a size of the model or border of the model (usually defined by the critical distance of the DModX parameter).
And then in the classification you want to know whether the new sample featuring in the prediction set is similar to this class or that class, or maybe it doesn’t fit any class at all. So the major goal here is to define borders around the classes and infer class membership for the future samples in the prediction set. You are not so much focusing on why the classes are different.
In PCA/SIMCA analysis you are defining the size of the model or border of the model (usually defined by the critical distance of the DModX parameter) not looking at what makes them different.
However, with discriminant analysis you are asking this question: What is the difference? Here you are targeting the variables. Which variables are driving the separation between the two groups? And if you have a two-group problem, the resulting OPLS-discriminant model will be very easy to interpret because you will have only one predictive component to interpret. This component is rendered as the x-axis in the score scatter plot arising from the OPLS-DA model.
Thus, the horizontal direction of the score scatter plot will capture variation between the groups. What are the systematic differences between the group to the left and group to the right? And then the vertical dimension and any higher component of the so-called orthogonal type will capture variation within the groups.
The horizontal component of the OPLS-DA score scatter plot will capture variation between the groups and the vertical dimension will capture variation within the groups.
SIMCA (PCA) vs. OPLS-DA
So the principal component analysis (PCA) model that is underpinning the SIMCA classification approach is a maximum variance method.
OPLS-DA also relies on a projection of X data as PCA does, but here we are rephrasing the question from one of corresponding to a maximum variance model to become one of corresponding to a maximum separation model. This separation is then possible because the projection of the X data will be guided by known class information. We will supply class information to the OPLS algorithm and thereby we go from an unsupervised modeling approach (PCA) to supervised modeling approach (OPLS).
The models that arise are easiest to interpret if we only have two classes in the OPLS-DA model, but of course the approach is extendable to more classes.
So how does OPLS-DA work?
With OPLS regression, Y variables are normally continuous. Observations are divided into two classes. OPLS-DA uses a binary variable for Y that represents class membership. Predictions have a value between 0 and 1 depending on class membership. Using this scheme, we then extend the process to more Y variables if we go beyond two classes.
When is OPLS-discriminant analysis suitable?
OPLS-DA is suitable for diagnosing differences between two groups or systems. It will tell us which variables have the largest discriminatory power. It will tell us how the variables are correlated. And, it will also quantify how much of the variation in the X block is actually relevant to the analysis question.
- How much of the variance in the X block is related to the class information?
- How much of the systematic variance in the X block is NOT related to the group separation?
OPLS-DA analyzes pre-defined groups. It is preferable, but not a must, that the groups have similar shapes and number of members. It’s widely used in -omics applications to identify potential biomarkers.
OPLS-DA is an excellent tool to find “What’s the difference” between two groups (such as Good and Bad product). The OPLS-DA model will indicate which are the driving forces among the variables and we can then make score plots to visualize the differences if they exist. We can use the loading plot to indicate the variables that express this difference. We can then drill down in the 2D or 3D variable plots to the underlying data to verify that what we see in the multivariate models also has anchors in the original and underlying data. This can, of course, be scripted for repetitive work.
SIMCA software uses a PCA-based method for classification known as the SIMCA method. It also contains an OPLS based method for discriminant analysis known as OPLS-DA.
The SIMCA software (in particular when used with graphical user interface software such as OMICS skin) can simplify MVDA analysis of -omics data and identification of biomarkers. The crucial step is to find the biological explanation to the potential biomarkers identified by OPLS-DA.