Clustering Analysis

Biostatistical Methods


Biostatistical Methods

BioXpedia is proud to offer data analysis using clustering analysis.

This data analysis focuses on using clustering methods to gain new knowledge about e.g. molecular subgroups or patterns of protein or gene expression.

The data analysis includes the following components:

  • Detailed PDF report.
  • Data handling.
  • Employment of state-of-the-art clustering methods.
  • Visualization of the clustering results.

Read below for more information on clustering analysis:

Clustering analysis aims to identify subgroups within a set of samples. The general approach for clustering analysis is to minimize the amount of difference in measurements within the same cluster and at the same time maximize the differences between clusters (Pellegrini et al., 2017).

Clustering analysis can be carried out with a preexisting knowledge of the number of clusters e.g. when carrying out case-control studies, which makes it possible to identify which variables best identify each cluster (Pan et al., 2013).

If the number of subgroups is unknown but is of interest, clustering analysis can be used to identify the number of clusters and which samples belongs to each cluster e.g. detecting molecular subgroups of a disease/condition or cell sub-types based on expression data (Menon, 2018).

K-Means Clustering

There are two main ways in which we perform clustering analysis, which is K-Means clustering (Lloyd, 1957) and hierarchical clustering (Ward, 1963).

K-Means clustering is based on a specific number of clusters (K), which is chosen by the researcher. If no prior knowledge exists about the number of clustering groups, several methods exist that can assist the researcher in specifying the number of clustering groups.
Clusters are defined by their center (called the centroid) – the average value of the observations belonging to that cluster.

Initially the samples are divided into K different clusters at random. Then the centroid for each cluster is calculated and each sample are reassigned to the cluster of the centroid that they are closest to. Then the centroids are again calculated, and samples reassigned to new clusters. This iterative process continues until the clusters no longer change.

Since the initial clusters are random, K-Means clustering can yield different end results each time it is performed. For this reason, it is usually performed multiple times and the clusters that best divide the data are chosen. The results from each round of K-Means are evaluated by summing the distance between samples and the centroid in all clusters. The clusters with the lowest resulting sum are chosen (Kakushadze et al., 2017).

Hierarchical Clustering

Hierarchical clustering is most often done using what is referred to as a bottom-up approach. In a bottom-up approach to clustering each observation is initially regarded as one cluster. Thus having 100 samples will result in 100 initial clusters. The distance between all clusters are calculated and the two clusters with the lowest distance are merged. The distance between clusters are recalculated and the two clusters with the shortest distance between them are merged. This process continues until there are only two large clusters.

The results of hierarchical clustering are often displayed as a dendrogram, which looks like an upside-down tree where samples are leaves. When the two lowest distance clusters are merged the corresponding leaves or branches are merged in the dendrogram.

The number of clusters that result from hierarchical clustering are determined by where the dendrogram is cut. The lower the cut height the more clusters will result. Thus, setting the cut height in hierarchical clustering is analogous to setting K in K-Means clustering (Ahlqvist, 2018). One of the advantages of hierarchical clustering is the intuitive visualization of the clustering groups using dendrograms. This allows for visually estimating the number of clustering groups. The number of clustering groups can then be used as input for K-means clustering.


PCA is generally used for visualizing the strongest trends in a dataset or between groups in a dataset. These groups can be e.g. sick or healthy or groups generated using cluster methods like K-means clustering. Below an example of PCA is given when clustering analysis has been performed using K-means clustering.

When performing K-means clustering and clustering analysis in general, it is desirable to be able to display the observations and clusters in a standard 2-dimensional plot with one variable on each axis. The number of variables is, however, often much larger than two, and far too many to represent visually.

One solution to this problem is Principal Component Analysis (PCA) (Hotelling, 1933). PCA tries to define artificial variables that explain as much of the variation in the data as possible. These artificial variables are called principal components and the two first principal component describe the most variation. Because of this first and second principal component are used to visualize the general trends in the dataset using a scatterplot. The principal components can summarize the information in multiple correlated variables in a single artificial variable. As a consequence, they will explain more variation and thus yield more well-defined clusters on a 2-dimensional plot. For this reason, the resulting clusters from K-Means clustering are often visualized using principal components (Wang et al., 2018).

How to interpret a PCA plot

A PCA plot is a lot like a scatter plot with the two first principal components on the x- and y-axis. The principal components (PC) are, as mentioned above, artificial variables that explain as much of the variation in the data as possible. These artificial variables do not have any units and thus the x- and y-values themselves are not important, but rather the values relation to each other.

On a PCA plot you will very often see a percentage at each axis. This indicates how much of the variation that is explained by each PC. In the example below principal component 1 (written as Dim1) explains 35,9% of the variation in the data. Principal component 2s (written as Dim2) on the y-axis explains 5,4% of the variation in the data.

On a PCA plot, datapoints belonging to different clusters will often have different colors or different shapes. E.g. you might color males and females in blue and green and represent whether they are cases or controls as circles and squares, such that a female control subject would be represented as a green square. This makes it possible to identify the groups of a study on the plot and see if they form distinct clusters. On the example plot below each group has both a unique shape and color. As seen on the plot each group forms a distinct cluster indicating that there are differences between the groups in the data.

The groups could just as well overlap on the plot. Small overlaps indicate that groups are slightly similar in the data and that we are not able to distinguish between groups in extreme cases. Large or complete overlaps indicate that we do not see any difference between groups in the data.

All in all, clearly separated groups on a PCA plot indicates that the data supports these groups i.e. that there is a difference in the measured variables between groups. Groups with large overlaps indicate that there are none or only small differences in the measured variables between groups.

  1. Ahlqvist, Emma et al. “Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables.” The lancet. Diabetes & endocrinology 6,5 (2018): 361-369.

  1. Hotelling H (1933) “Analysis of a complex of statistical variables into principal components”. J Educ Psychol 24:417–441, 498–520

  1. Kakushadze, Zura, and Willie Yu. “*K-means and cluster models for cancer signatures.” Biomolecular detection and quantification 13 7-31. 2 Aug. 2017.

  1. Lloyd, S. P. (1957). “Least squares quantization in PCM”. Technical Report RR-5497, Bell Lab, September 1957

  1. Menon, Vilas. “Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data.” Briefings in functional genomics 17,4 (2018): 240-245.

  1. Pan, Wei et al. “Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty.” Journal of machine learning research : JMLR 14,7 (2013): 1865.

  1. Pellegrini, Michael et al. “Cluster analysis and subgrouping to investigate inter-individual variability to non-invasive brain stimulation: a systematic review.” Reviews in the neurosciences 29,6 (2018): 675-697.

  1. Wang, Kesheng et al. “Principal component analysis of early alcohol, drug and tobacco use with major depressive disorder in US adults.” Journal of psychiatric research 100 (2018): 113-120.

  1. Ward, Joe H. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association, vol. 58, no. 301, 1963, pp. 236–244. JSTOR