What do we mean by pre-processing of data, and why is it needed? Let's take a look at some data pre-processing methods and how they help create better models when using Principle Component Analysis (PCA) and other methods of data analytics.
Using a data pre-processing method like unit variance scaling, makes it possible to analyze data of markedly different numerical ranges, such as height and weight of soccer players, to identify outliers or patterns.
Advanced data analytics software like SIMCA uses a number workhorses, such as PCA, PLS, OPLS and O2PLS. These form the basis of multivariate data analysis. Their most important use is to represent one or more multivariate data tables as a smaller list of summary indices (latent variables) in order to observe trends and outliers. This overview may uncover the relationships between observations (rows of the data tables) and variables (columns of the data tables), and among the data tables themselves.
However, what few people are aware of is the crucial importance of standardizing and regularizing (or "pre-processing") the data prior to starting any data analytics modeling. Before PCA, PLS, OPLS and O2PLS can be performed, the original data must be transformed into a form suitable for analysis. That means reshaping the data in order to fulfill important assumptions. Pre-processing the data can make the difference between a useful model and no model at all.
Some methods of pre-processing data include:
- Advanced scaling
- Data correction and compression
The first two methods are described in this post. The others will be addressed in a future post.
Scaling of data
Variables often have substantially different numerical ranges. A variable with a large range has a large initial variance, whereas a variable with a small range has a small initial variance. Since PCA, PLS, OPLS and O2PLS are maximum variance or co-variance projections, it follows that a variable with a large initial variance is more likely to be expressed in the modeling than a low-variance variable.
For example, let’s consider data from a pre-season friendly game of football (soccer), in which the trainers of both teams measured the body weight (in kg) of their players. The trainers also record the body height (in m) of each player. The data are plotted below.
(above, left) The scatter plot shows body weight versus body height of 23 individuals. The data pattern is dominated by the influence of body weight because the variables have the same scale. (above, right). Now the variables are given equal importance by displaying them according to the same spread. An outlier, a deviating individual, is now discernible.
When the two variables are plotted in a scatter plot where each axis has the same scale – the x and y axes both extend over 30 units – we can see that the data points only spread in the vertical direction (left chart). This is because the body weight has a much larger numerical range than body height. Should we analyze these data with PCA and the other projection method available in SIMCA, without any pre-processing, the results would only reflect the variation in body weight.
However, this dataset actually contains an atypical observation (individual). Without appropriate scaling, however, we would miss this fact. It’s much easier to see this atypical observation when the two variables are scaled more appropriately (right chart above). In effect, the right graph has “zoomed in” on the important information. In this case, it shows us that a height/weight relationship exists among the soccer players such that, in general, the taller the player, the heavier the player. Only one person breaks this height/weight relationship, the atypical individual, who happened to be the referee of the soccer game.
In order to give both variables – body weight and body height – equal importance in the data analysis, we standardize them. Such standardization is also known as “scaling” or “weighting” and means that the length of each coordinate axis in the variable space is regulated according to a pre-determined criterion. The first time you analyze a dataset, however, you should consider making the length of each variable the same.
Unit variance scaling
There are many ways to scale the data, but the most common technique is unit variance (UV) scaling. UV-scaling is default in SIMCA, but how is it done? For each variable (column), you calculate the standard deviation and obtain the scaling weight as the inverse standard deviation. Subsequently, each column of X is multiplied by the inverse standard deviation value. Each scaled variable then has equal variance, unit variance.
We can show a simple geometrical representation of UV-scaling based on the equivalence between the length of a vector and its standard deviation (square root of variance). Hence the initial variance of the variable is interpretable as the squared “size” or “length” of that variable. This means that with UV-scaling, we accomplish a shrinking of “long” variables and a stretching of “short” ones (see figure below). By putting all variables on a comparable footing, no variable is allowed to dominate over another because of its length.
(above) The affect of variance scaling is shown. The vertical axis represents the “length” of the variables and their numerical values. Each bar corresponds to one variable and the short horizontal line inside each bar represents the mean value. Prior to any pre-processing, the variables have different variances and mean values. After scaling to unit variance, the “length” of each variable is identical. The mean values still remain different however.
Like any projection method, PCA is sensitive to scaling. That means that by modifying the variance of the variables, it is possible to attribute different importance to them. This gives the possibility of down-weighting irrelevant or noisy variables. However, you must be careful to avoid scaling subjectively to achieve the model you want, rather than objectively.
Generally, UV-scaling is the most objective approach, and is recommended if there is no prior information about the data. Sometimes no scaling at all would be appropriate, especially with data where all the variables are expressed in the same unit, for instance, with spectroscopic data.
Mean-centering is the second part of the standard procedure for pre-processing data. With mean-centering, the average value of each variable is calculated and then subtracted from the data. This improves the interpretability of the model. A graphic interpretation of mean-centering is shown below.
(above) After mean-centering and unit variance scaling, all variables will have equal “length” and mean value zero. Another name for this scaling method is “auto-scaling.”
Skins enable scaling, or mean-centering
Mean-centering and UV-scaling procedures are applied by default when using SIMCA software. However, it's possible to enable other default settings by using special omics or spectroscopy "skins" in SIMCA (custom views available in SIMCA). For example, when running the spectroscopy skin, mean-centering but not scaling is default. When running the omics skin, depending on which data is imported, either mean-centering only or Pareto scaling can be chosen. Being able to modify the default settings and select different plotting configurations is one of the reasons to use these skins.
Watch this video to learn more about the omics skin
Watch this video to learn more about the spectroscopy skin
Want to know more?
Get the textbook: "Multi- and Megavariate Data Analysis. Basic Principles and Applications.” (3rd edition) to learn more about the principles behind data analytics.