 ## Predictive models

Biostatistical Methods BioXpedia is proud to offer data analysis using predictive models.

This data analysis focuses on developing a biomarker signature for e.g. diagnostics or prognostics.

The data analysis includes the following components:

• Detailed PDF report.
• Data handling.
• Development of biological signature.
• Employment of state-of-the-art machine learning methods.
• Evaluation and visualization of performance using ROC plots.

Predictive methods are used when the researcher is interested in being able to classify samples or predict their development based on measurements or observations.

Classification can be used to diagnose patients based on clinical or genetic variables, e.g. to tell if a patient has cancer or not, or even what type of cancer a patient has (Ram et al., 2017).

Predictive methods are also useful in prognostics to predict how a disease will develop over time, e.g. to predict whether a tumor is likely to metastasize (Kate et al., 2017; Marchese Robinson et al. 2017).

Predictive methods are a part of supervised machine learning and as is the case in all supervised machine learning applications, the model needs to learn or to be trained. Because of this the data is often divided into a training and test data. Training data is used to train the model, so that the model learns the relationship between variables. When the training is done, the test data is used to evaluate how well the model performs on data that it has never seen before by comparing its predictions to the true conditions in the test data. If the accuracy of the model is satisfactory, it can then be used on completely new data where the expected or true output is not known. It is often necessary to go through several validation studies before a model can be used for example in the clinic.

There exist many types of models that can be used for prediction. Below we give examples of two powerful multivariate predictive models that are often used.

## LASSO

Predictive methods include both parametric and non-parametric methods. A parametric method means that the model makes assumptions that the data must conform to.

One example of parametric methods is linear regression models. A common method for linear regression is Lasso regression (Tibshirani, 1996). Lasso is similar to simple linear regression but adds a penalty parameter to increase the reliability of predictions. Furthermore, the penalty parameter makes Lasso able to perform variable selection. Variable selection is described as selecting a subset of variables originating from a larger set. Variable selection is useful when the number of variables is large and some of them do not improve the predictions. Lasso will exclude these variables from the model, which makes it easier to interpret.

Lasso is an example of a parametric method, which means that the model makes some assumptions about the data. The most obvious assumption is that the relationship between the explanatory variables and the variable we wish to predict is linear.

## Random Forests

If the relationship between explanatory variables and the predicted variable is not linear, Random Forests can be used instead. Random Forests is a non-parametric method, and thus the model does not make any prior assumptions about the data distribution.

Random Forests is a tree-based model and to make its predictions it constructs decision trees. Decision trees describe a branching pattern in a series of questions. A small decision tree to predict the cancer risk group of an individual, might first ask if a person smokes and then ask if a person is obese. A yes to both question would put the person in a high risk group, no to both questions would put a person in a low risk group and one yes and one no might categorize a person as being in a medium risk group (Cheng et al., 2018).

The simple example above is an example of classification. Random forests can, however, also be used as a regression model to predict a numeric variable.

The difference between random Forests and a simple decision tree is, that Random Forests tries to minimize variance in its predictions by constructing many decision trees based on random subsets of the training data. The predictions from all these decision trees are then collected as “votes” and the most prominent “vote” is selected as the final prediction (Breiman, 2001).

1. Breiman, L. “Random Forests”. Machine Learning45, 5–32 (2001).

1. Cheng, Li et al. “A random forest classifier predicts recurrence risk in patients with ovarian cancer.” Molecular medicine reports 18,3 (2018): 3289-3297.

https://www.spandidos-publications.com/mmr/18/3/3289

1. Kate, Rohit J, and Ramya Nadig. “Stage-specific predictive models for breast cancer survivability.” International journal of medical informatics 97 (2017): 304-311.

https://www.sciencedirect.com/science/article/abs/pii/S1386505616302507?via%3Dihub

1. Marchese Robinson, Richard L et al. “Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets.” Journal of chemical information and modeling 57,8 (2017): 1773-1792.

http://eprints.whiterose.ac.uk/119210/

1. Ram, Malihe et al. “Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest.” Iranian journal of pathology 12,4 (2017): 339-347.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5844678/

1. Tibshirani, Robert. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, 1996, pp. 267–288. JSTOR.

https://www.jstor.org/stable/2346178?seq=1