Umetrics Suite Blog

What tools make DOE data analysis faster and more accurate?

April 19, 2018

In life science, biopharma and other areas of research, development and production, design of experiments (DOE) provides a systematic method to determine cause and effect relationships between factors and responses affecting a process, product or analytical system. But the key to understanding your results is effective analysis of your experimental data.

 effective data analysis of experimental designs can make the different between discovery and failure

In life science, pharma, biopharma and other areas of research, development and production, effective data analysis of experimental designs can make the difference between discovery and failure.

Efficient and accurate design and analysis of experiments can make the difference between successfully optimizing your testing process, or wasting time redoing experiments. Or worse, failing to uncover important cause and effect relationships in your tests.

Using the right data analytics tools and statistical analysis of experimental data can help improve your process and give you a level of confidence in your methods, as well as your results. Let’s take a look at some of the methods and tools used for experimental design data analysis, as well as a couple of design of experiments examples.

DOE data analysis wizard

Data analysis of an experimental design can be tricky for the newcomer, as well as for experienced researchers. Using tools that help you verify the statistical accuracy of your results, cull the data, and ensure your models are viable, will go a long way toward giving you a successful design of experiments output.

The DOE data analysis wizard in MODDE software provides a guided step-by-step approach to verifying, visualizing, and analyzing your experimental data. Let’s take a look at these tools included in the wizard:

  • Replicate plot
  • Histogram plot
  • Coefficient plot
  • Summary of fit plot
  • Residual plot
  • Observed vs. predicted plot
  • Summary page at the end

Replicate plot

replicate plot

The first tool, or plot, presented by the data analysis wizard is the replicate plot, which displays the measured response values for all experimental runs. It allows you to see the variation in the results for all experiments in a quick raw data inspection.

The wizard runs several checks in the background for you here. First, you have a variability test that checks to see if the numerical spread in replicated points is large or small compared to the numerical spread across all experiments. Repeated experiments appear in different colors connected by a line (as seen in the figure above).

Low variability in the replicates implies that your measurement reproducibility is good. When replicated experiments show too much variability, a warning will appear. Replicated experiments deviating significantly from the other replicates should be checked.

Another check that is done in the background is the interquartile range test. This checks for outliers in the measured data (not just among the replicated experiments). If there is a problem, the green will change to a yellow warning. For data that have a skewed distribution, some observations may be interpreted as outliers. Those can be handled with a suitable transformation, which is the next step in the analysis wizard.

Histogram plot 

histogram-plot

The second component of the analysis wizard is the histogram plot. It shows the shape of the response distribution and uses a skewness test in the background to determine if a transformation is needed. The desired distribution is a “bell shaped” normal distribution.

The proper estimate of the distribution requires a minimum of 20-25 observations. The auto transform tool can be used when distribution of the data isn’t “normal.” In general, normally distributed responses will give the best model estimates and statistics.

(You can find out more about the equations used for the skewness test and statistics involved in the user guide of the MODDE software).

Coefficient plot

coefficient plot 1

The next tool in the analysis wizard is the coefficient plot. This displays the impact of the various model terms on the response variable you are currently modeling. One of the things you always want to watch out for are coefficients that have a confidence interval overlapping zero, and which can be removed. Sometimes small model terms may survive even though they are, according to the confidence interval, not significant, but they will survive because they give a beneficial contribution toward increasing the Q2.

coefficinet plot 2

In this view the coefficients overlapping zero have been removed, but one survived.

At this point, two additional tools that can be useful are the square test and interaction test. Both are available in the MODDE data analytics wizard.

  • Square test - The square test automatically checks for square terms. It adds the square terms one by one to the linear model, and investigates whether each term contributes beneficially to the model quality.
  • Interaction test – This test investigates if there are significant interactions terms that may help improve the linear model.

These tests can examine which model terms contribute to the increase in the Q2. These are typically used for a screening DOE that uses a linear model. When you are doing an optimization DOE, you have square terms in your model by default, so then these tests don’t apply.

Summary of fit plot

summary of fit plot

The fourth stage of the analysis wizard is the “summary of fit” plot. This displays four model performance indicators.

The green bar (shown above) is the R2 (model fit) and shows how well you can make your model fit the measured data. (0.5 is a model with rather low significance).

The blue bar (Q2) shows an estimate of the future prediction precision. Q2 should be greater than 0.1 for a significant model, and great than 0.5 for a good model.

The yellow bar is model validity and is a test of whether the right type of model is fitted to the measured data (linear model vs interaction model vs quadratic model). A value less than 0.25 indicates statistically significant model problems, such as the presence of outliers, and incorrect type model or a transformation problem.

The last light blue bar (cyan), indicates the reproducibility of your data. When replicates have a very small spread numerically their reproducibility bar is high. Showing we have a very strong model. Reproducibility is the variation of the replicates compared to overall variability. A value greater than 0.5 will deliver the best results.

Residuals normal probability

residuals-n-plot

Next in the analysis wizard is the residuals N-plot. This indicates the outliers in the data set. Outliers are experimental points which have an unusually large difference between the measured response value and the value estimated by the model.

This test is based on the critical assumption behind multiple variant analysis (MLR) that the fitted model will have normal distributed residuals with a mean value of 0 (zero). If you have a normal distribution of all the residual points, they will very neatly line up along a diagonal line, like the example above.

Here, the diagonal line has the anchoring point of 0.5 (or 50%) and a mean value of 0. You can also see the action limits at -4 standard deviations and + 4 standard deviations. So as long as all points are inside that interval, the result is OK and suggests no outliers in the data set. However, as soon as there is a point violating either plus or minus 4, a yellow warning triangle will appear in the information part of the tool.

Observed vs. predicted plot

observed vs predicted plot

The final plot of the wizard visualizes the relationship between measured and calculated data. For a good model, the points should be close to a straight line. The closer the points are scattered around the line, the better the model and the higher the R2 (recall the green bar in the Summary of fit plot). A curved structure might indicate non-linearity.

One-click logics

At each stage of the analysis wizard there is the One-Click option. The option is there to help the novice to perform reliable data analysis and avoid making mistakes. The One-Click feature will apply some “smartness” in the data evaluation and model tuning. Unless encountering a situation where user action/decision is needed, it will automatically step through all the steps in the analysis wizard and report final results in terms of a final summary page. The summary page (not illustrated) is the last stage before exiting the analysis wizard.

DOE analysis examples

Would you like to see some examples of how these data analytics tools can be applied to design of experiments data? Watch our webinar video called “Lean and clean DOE using One-Click analysis” which includes two case examples:

  • Screening example – Table excipients (Pharmaceutical)
  • Optimization example – Truck engine (Manufacturing)
  • Plus, a demo of the MODDE data analysis wizard.

Watch the video

 

Want to try it?

Get a free trial of MODDE software, which includes a Design of Experiments data analysis wizard.

Get MODDE free trial

 

 

Topics: Multivariate Data Analysis, Design of Experiments (DOE), Data Interpretation & Analysis

Lennart Eriksson

Written by Lennart Eriksson

Sr Lecturer and Principal Data Scientist at Sartorius Stedim Data Analytics

Leave a comment