Umetrics Suite Blog

How Principal Component Analysis Helps Optimize Chemical Selection for Solvents, Reagents and Molecules

June 12, 2019

In agrochemical, pharmaceutical and other industries that manufacture complex chemicals, finding ways to reduce waste and improve inefficiencies often hinges on selecting the right chemical compounds. Data analytics can help manufacturers find alternative compounds that meet complex requirements, decrease raw material usage or enable more cost-effective, sustainable processes.

But selecting the best chemical compounds is no simple matter. Chemists can be faced with conflicting considerations when evaluating various compounds such as solvents, reagents, catalysts and chromatographic materials. Beyond chemical functionality and physical properties, a whole subset of factors from regulatory concerns to robustness to safety can play a role.

chemicals manufacturing data analytics to select reagents molecules and solvents

The data analytics method of principal component analysis provides a way to select reagents, columns, solvents and other chemical compounds based on empirical data.

Rather than relying on trial and error or using compounds based on descriptions in publications, chemists can use data analysis of their own process data to visualize key performance indicators (KPIs), or the factors that matter most, using a method known as principal component analysis.

Some common KPIs of organic process chemistry may include: process mass intensity (PMI), costs per kg final product, overall yield, step count, longest linear sequence, volume-time outputs, solubility or similarity of other chemical properties. These can be calculated from data found in batch records or representative laboratory experiments. Soft factors, such as process robustness – which are typically hard to quantify but are important for judging the quality of a chemical process – can be evaluated with data analytics as well.

Read more: Case Story about molecule selection

Uncovering chemical similarity

When it comes to evaluating the chemical properties of a compound, data analytics becomes useful. Principal component analysis (PCA) provides a means for selecting subsets of compounds based on specific characteristics. PCA does this by finding the major sources of variation in proprietary data using a mathematical technique called projection. This projection of the data finds the main trends or “latent variables” characterizing a data set. The results may be displayed graphically for ease of interpretation.

PCA provides a visual display of chemical data in two main plots:

  • scores plot — which is a summary of the observations
  • loadings plot — which is a summary of the variables

The two plots are typically used in combination to gain an understanding of the data at a glance. Chemicals which lie close together in the scores plot have similar properties, while those lying far apart have different properties.

Some of the ways this can be used is for: selecting solvents, visualizing data from HPLC columns, selecting reagents and surfactants, and mapping molecules.

Selecting solvents using data analysis

The selection of an appropriate solvent for a chemical synthesis or chromatographic separation is often driven by experience, habit or even by what is available. Many less well-known solvents may never be tried due to unfamiliarity. A simple way of comparing solvents is to conduct a PCA on a table of readily available property data. This yields a chemical map of solvents and diverse or similar solvents may be selected with ease.


Solvents with similar properties will have the same color on the plot, and those in other categories will have different colors.

Read more: Case Story about molecule selection

Visualizing data from HPLC columns

There are thousands of HPLC columns on the market and building a library of useful columns may require years of experience. A rational way of achieving this is to compile a table of relevant chromatographic properties measured on a series of columns and use PCA to visualize the data. A small diverse library of columns may be found or a column from an alternative supplier with similar properties may be selected.

blog20-hplc properties

Using PCA to evaluate the data from HPLC columns makes it easy to see which columns have similar or different properties. A higher number of theoretical plates indicates a better separation efficiency of the column.

Selecting reagents and surfactants

Properties of the building blocks used to make products such as drugs, surfactants and agrochemicals have a profound effect on the efficacy of the final product. PCA is an ideal way of investigating the properties of potential substituents before synthesis is undertaken.

The principles of Design of Experiments (DOE) may be applied to select a small set of diverse and representative compounds from a PCA projection for testing. Having tested the subset of compounds, the properties of the untested compounds may be predicted using a multivariate QSAR model based on Partial Least Squares Regression (PLS). In this way, there is huge potential for reducing testing costs.

blog 20 heterocyclic

The illustration shows a representative subset of heterocyclic bases selected from a PCA of calculated properties using a D-Optimal Design.

Amino acid mapping for molecule selection

Measured and calculated properties of amino acids may be encoded using PCA to produce new property scales, which may subsequently be used in quantitative sequence activity modeling, peptide QSAR or structural proteomics. These scales, derived from a PCA of the properties, reflect lipophilic, size and charge characteristics of the amino acids.


The plot on the left shows the scores map of naturally occurring amino acids and the plot on the right displays the variables that characterize them. The first component mainly reflects lipophilicity and the second molecular size. The third (not shown) reflects electronic properties.


Principal component analysis is an effective way of displaying chemical data to inform decisions about compound selection. Chemical maps provide a way to select reagents, columns, solvents and other chemical compounds based on empirical data rather than through guesswork or trial and error.

Data analytics tools such as SIMCA that use principal component analysis to visualize data with color-coded maps make it easier for chemists, researchers and manufacturers to select chemical compounds more effectively. This method can be applied across a broad range of chemical applications, as some of the examples above show.

Read an example for molecule selection 

Find out how a leading producer of specialty chemicals used the Umetrics Suite to develop a better way to screen molecules and save resources. The company created a Virtual Lab that uses the SIMCA multivariate data analytics engine to visualize different molecules in relation to each other.

Get the Case Story


Topics: Data Analytics, Data Visualization, Principle component analysis (PCA)

Stefan Langner

Written by Stefan Langner

Stefan Langner is Market Manager for Process Industries (Food & Beverage and Chemical) markets at Sartorius Stedim Data Analytics.

Search the Blog

    Subscribe to the Blog

    View the:

    Data Analytics Glossary of Terms

    List of Webinars

    Get a free trial