Umetrics Suite Blog

Data analytics techniques help researchers identify promising breast cancer biomarkers

March 12, 2019

Breast cancer is the most commonly diagnosed cancer amongst women worldwide and a leading cause of cancer related deaths among females. It’s the second most common type of cancer overall. According to the International Agency for Research on Cancer Research, there were more than 2 million new cases in 2018.


Using data analytics methods such as PCA and OPLS-DA can help researchers find potential breast cancer biomarkers from analyzed lipid samples.

The survival rate of breast cancer is highly dependent on early diagnosis. But one of the problems with breast cancer today is that it’s often diagnosed at a late stage. The main methods of early detection are breast self-exam (fingers) and mammography, which tends to happen in later stages.

Mammography not optimal for early detection

Mammography is currently the most widely used method for detection of breast cancer. However, it has a high false positive rate. The rate of over-diagnosis of breast cancer in screening mammography ranges from 0% to more than 30%. After 10 yearly mammograms, the chance of having a false positive is about 50-60 percent.

In addition, recent large-scale retrospective studies of Norwegian, European and North American women indicated that routine mammography had little or no impact on breast cancer mortality. In light of this, there is a need for an improved early detection method, preferably one that is not invasive.

In search of a breast cancer biomarkers

One consideration for a minimally invasive diagnostics tool for early detection of malignant breast tumors would be a biomarker that can be detected through a blood screening. For breast cancer, it has been well documented that metabolomics or lipidomics have shown potential for cancer diagnosis.

However, treatment for breast cancer is dependent on the sub-type. And identifying the subtype isn’t always easy. There are five molecular subtypes: Luminal A, Luminal B, HER2+, Basal-like or triple negative, and unclassified.

invasive ductal carinomas

Breast cancer treatment is dependent on the molecular sub-type, such as Luminal A, Luminal B, HER2+, Basal-like or other. (Image from: Sandhu et al., LabMed 2010 41(6): 364-372.)

Both qualitative and quantitative analysis of lipids can be done using data from liquid chromatography and mass spectrometry. Univariate and multivariate statistics, including principal components analysis (PCA) and regression analysis, can then be used to identify variations in individual lipid species and/or to lower dimensions for visualization and grouping of cases and controls. From this, possible (single or multi-parameter) markers can be identified.

An example using six breast cancer cell lines

In a recent study conducted by two researchers in Iceland, Dr Finnur Freyr Eiríksson, of ArcticMass Ltd. Reykjavík, Iceland and Prof Margrét Thorsteinsdóttir, Faculty of Pharmaceutical Sciences, University of Iceland, six breast cancer cell lines were reviewed for potential lipid biomarkers.

The aim of the project was to investigate the lipidome of breast cancer cell lines and to:

  • relate the lipidome to subtypes
  • search for novel biomarkers for the subtypes (to distinguish between the subtypes)

The overall goal was to discover a new and improved diagnostics tool for breast cancer with a non-invasive method.

The researchers investigated the lipidome of six breast cancer cell lines utilizing an ultra-performance liquid chromatography mass spectrometry (UPLC-QTOF-MS) platform. PCA and OPLS-DA analyses revealed lipid biomarkers for specific BC subtypes, which may serve as biomarkers in the early diagnoses of cancer and give the opportunity of subtype specific treatment.

After lipids were extracted and injected onto the UPLC-QTOF-MS, the data was processed using PROGENESIS™ QI and then imported into SIMCA for MVDA. The researchers looked for changes in the lipidome between the breast cancer cell lines and compared them to the reference cell line.

The first step in the data analysis was to use PCA (Principal Component Analysis) to evaluate the differences between the breast cancer cell lines. PCA is a quick method to visualize the trends and get an overview of the data. The graph (below) is based on 439 ion features that were identified in the Progenesis QI software.

We can see clearly that each cell line clusters together. Each cell line consists of three biological replicates (each were done three times leading to 9 dots). It was easy to see that the CAMA-1 cell line was closer to the reference line than the other luminal A cell lines.

trends using pca her2

The first step is to look for trends using PCA.

The conclusion the researchers could draw from this graph is that we can distinguish between the breast cancer cell lines based on the lipidome and the sub-types can identified based on specific lipids.

Then, using orthogonal partial least squares discriminant analysis (OPLS-DA), the researchers were able to quantify the features that are most different between the reference cell line and each of the six breast cancer cell lines. When they compiled the features, they found 242 that substantially overlapped between the cell lines and were able to positively identify 77 that belonged to several known lipid classes.


The red dots show what is elevated in the sub-type compared to the reference cell line. The blue dots show what is down-regulated in the sub-type compared to the reference class. This was completed for each of the cancer cell lines.

Using OPLS, the researchers were able to identify trends where lipids were present in specific subtypes. For example, there was a clear increase in Triacylglycerols in SK-BR-3 (or HER2), but the increase was only in those with a specific fatty acid length lower than or equal to 46.

triacylglycerols tgs-up-regulated-image

Triacylglyceroles were up-regulated in identified cell sub-types.

The conclusion the researchers made, using data analytics, were that triacylglycerols may be used to distinguish between breast cancer cells. And, that the subtypes defined by the transcriptome are indeed reflected in the lipidome and may be used to further define the division within the breast cancer subtype.

This shows how PCA and OPLS-DA analysis helped uncover lipid biomarkers for specific subtypes of breast cancer. These biomarkers may help in the early diagnoses of cancer and give the opportunity of subtype specific treatment.

Find out more

To find out more about how the study was constructed and what MVDA helped to uncover. Watch this recorded webinar with two guest speakers.

Lipidomics study reveals promising biomarkers for breast cancer subtypes

Register to watch the recorded webinar and download the presentation.

Watch webinar



Topics: Multivariate Data Analysis, Spectroscopy, Pharmaceutical manufacturing, OPLS, Medicine/Health

Lennart Eriksson

Written by Lennart Eriksson

Sr Lecturer and Principal Data Scientist at Sartorius Stedim Data Analytics

Search the Blog

    Subscribe to the Blog

    View the:

    Data Analytics Glossary of Terms

    List of Webinars

    Get a free trial