How Multivariate Data Analysis Can Separate the Players from the Gorillas
We have more data than ever before coming at us from many sources – both in our personal lives as well as business. Data is everywhere: from the production flow of a manufacturing floor to the sales results in a grocery store to the number of shares a page gets on Facebook. How do you sort it all out in a way that makes sense? Which data should you worry about and which should you ignore?
Using data analysis tools or software, such as the Umetrics Suite of Data Analytics Solutions, you can turn the jungle of numeric data into meaningful observations that you can use to improve your product, process or situation. Top data analysis software such as SIMCA combines reliable statistical methods, processes and tools that help you perform multivariate data analysis, look for deviations and understand the relationship between different data points.
Whether you are using historical data that currently resides in your database, or generating new data for future use, understanding the basics of multivariate data analysis will help you achieve better analytics.
What is multivariate data analysis?
MVDA is a statistical technique used to analyze data that is generated from more than one variable. MVDA will help you estimate summary variables. What are those? Well, one good example is a stock index.
For example, in the chart below, you can see how the Dow Jones stock index is changing as a function of time over a 48 hour period in August 2013. Dow Jones is monitored by investors and depending on how this summary index is migrating up and down over time, investors may decide on different actions: they may sell, buy, or hold.
The Dow Jones index is simply a summary of the stocks of the individual companies weighted together using a specific algorithmic function (as you can see in the formula written below).
Process modeling (analyzing the processes used for manufacturing, for example) is conceptually very similar. As with the stock index, we want to calculate the summary indexes for how the process is performing. This summary can apply to a continuous process, a batch process, or hybrid of the two, or any other kind of data-table.
What a process summary index has in common with the Dow Jones index is that it presents a trend over time. You monitor it to see how it goes up or down. If you are a process engineer or manager, you may take various actions or refrain from actions based on how this summary in-dex looks over time. You can see how it looks in the chart below.
One difference in a chart like this when you are dealing with process data instead of a stock index, is that you will reference the trajectory of the summary index in relation to lower and up-per limits. Usually this is a plus or minus two or three standard deviations which may set, for example, warning or action limits (as indicated by the dotted lines in red and yellow).
So what we are striving to achieve with multivariate data analysis is to calculate summary indexes covering the most essential information in our process measurements.
An example using soccer players
To get another example of how this works, we can use the example of soccer players and analyze the player’s height and weight to look for a pattern. In the chart below, the green dots rep-resent the body height and weight of 200 elite soccer players who played in the 2014 World Cup championships in Brazil. What we can do with multivariate data analysis is to create a summary index for how the weight and height changes among these elite soccer players. We are looking at the relationship between the two variables (the height and the weight) across all the players.
The summary index is shown by the red dashed arrow. The chart shows a clear relationship between body weight and body height among these top level soccer players. Generally, as the height increases, there is an increase in body weight as well. Using a plot like this, we can see a typical variation limit, and from that, we can define a variation limit or border of this model. The border of the model is marked in the chart below with an ellipse.
Typically, when we are doing multivariate data analysis and calculating these summary variables, we will position the border at the 95 percent level. So if you have fairly normally distributed data or uniformly distributed data, you would expect that 95 points of 100 would be situation inside the model border, and five points would be outside the border. So if you have around 200 soccer players, roughly 10 players should be outside the model border.
Using MVDA, you can not only define the summary indexes that summarize the essential information in your data, you can also define an envelope around the data points indicated by the typical data distribution, the normal or “good” behavior.
Changing the scope of data
The interesting part comes when you apply this data analytics model to future samples. Will the new data conform to our model, i.e. the expected variability among the samples so far analyzed? For example, what happens if we add the data or measurements of an elite basketball player to our soccer player data?
Several things happen when we add this player to the data.
First of all, the orientation of the summary index changes a bit. The original index is yellow and the updated one, taking the basketball player into account, is red. You can see there is a slight rotation in the direction of the summary variable following the introduction of the data points for the basketball player.
Second, although it’s obvious that the basketball player is very different from the soccer players, the weight-height relationship for the basketball player conforms to that of the soccer players. Looking at this data, we can see that the weight-height relationship among the variables is not changing when we add the basketball player to the data set. So while there is no doubt that the basketball player is very different from the soccer players, he still conforms to the model.
This is a typical behavior in data analytics. When you apply your model to future samples, you may have a deviation, but it doesn’t necessarily modify what we call the correlation structure.
Adding a third set of data
We can take a look at the effect of another deviation on the model. For example, if we add in the body height and weight of a sumo wrestler, we will see a completely different behavior for the summary index. In the graph below, the yellow line represents the original summary index and the red line shows the influence of the new data on the summary index, which is rotating up. Clearly the sumo wrestler has a higher influence in the data, or has a higher “leverage” as we say in statistics.
If we take a closer look at the deviation compared to the typical soccer player (for example, us-ing a diagnostic tool found in SIMCA called the contribution plot), we can see the reason, or the pattern, in how the sumo wrestler is deviating from the model of the elite soccer players. The graph below shows that the Sumo wrestler is slightly taller than the typical soccer player (not by much) but on the other hand he is very much heavier than the typical soccer player.
An unexpected deviation
Sometimes, your data will present a deviation that is completely different than you expected. So how can you interpret that? In the example below, we have a point that is completely off the model. In this case, we have added in the data for height and weight of a gorilla.
So according to the contribution plot, the gorilla is much shorter than the typical soccer player, but much heavier. There is no way on earth we can get the model for the soccer players to include the data for the gorilla. If we want to have a good model for gorillas, we must add in the measurements for more individuals, and calculate a local model for the gorillas.
This is also something you may come across when you’re modeling process data. In this case, you will need to have more than one model. You will need to have a local model for one type of production condition and another local model for a second type. One of the critical issues going forward then will be which model is applicable when we add in our next set of data.
Aligning Data to Tell a Story
As you can see, the objective of multivariate data analysis is to organize our data so that it can tell a useful story. How do we do that? In summary, we must:
- Calculate the relationship between the variables
- Define a ‘normal’ region within which most of the data points lie
- Use that information to diagnose future samples (data points)
Applying this to production or manufacturing, we would use the same principles. If the future samples end up inside the black ellipse, then we know that the second day’s production condition is in line with what were ‘good’ conditions previously.
But if we have a deviation, we may have one of several types:
- A deviation that is extreme, but still conforms to the relationship among all our measured variables (Basketball Player)
- A deviation that is influencing the direction of our summary indexes, which we may or may not be able to add into our model (Sumo Wrestler)
- A data point that is fundamentally different and completely off the model (Gorilla)
So using this analogy, the key is to get the gorillas out of your data, identify if the Sumo wrestlers are worth keeping, and adjust your process settings to cover up the data gap if there is a desire to account for the Basketball players.
Want to know more?
Download this presentation with an example of how SIMCA solution can help conduct powerful data mining to improve your decision-making process.