## Statistical analysis

#### Introduction

Statistics is a science derived from mathematics which can be divided into two branches: descriptive and inferential. Descriptive statistics are commonly used as the first step in data analysis. It refers to measures that summarize and characterize a set of data – it answers questions like: How many people have the disease? How often did the event occur? What is the spread of test results in the population? Descriptive statistics can point to similarities and differences between groups, but on their own are not enough to confirm or refute a hypothesis. Inferential statistics allows hypothesis testing based on the probability theory. It answers the key question: How likely is it that this difference (that we observed between two groups) is due to chance alone?

Before analyzing you data you should have a Statistical Analysis Plan (see below for links to templates, guidelines and examples). This document describes what statistical tests you will perform and assigns them to a hierarchy of primary, secondary and, sometimes, exploratory. Many researchers will publish the statistical analysis plan, either summarized as part of a design paper or study protocol, or by making it available online. This ensures transparency and encourages rigorous scientific method.

Your primary analysis relates to your primary study outcome. It is essential to spell this out from the start. Remember that a P-value of 0.05 just means a likelihood of one in twenty, meaning that if you do twenty statistical tests, the chances are that one of them will be ‘positive’ with a P-value < 0.05. So it is important to specify your primary outcome before doing the analysis so that readers know that you have not simply performed many statistical tests and chosen the one with the significant P-value. Analyses that you plan to do in advance, alongside the primary analysis are called secondary analyses. Any analysis that you decide to do only after looking at the data is always considered exploratory meaning that it can only ever provide a suggestion for further research and should never be used as proof.

An important step before applying any statistical test is to identify the dependent and independent variables. In clinical trials, the dependent variable is the outcome(s) of the study (e.g. mortality rate; change in renal function) and the independent variables are the factors under investigation that could possibly be modifying the outcome, the most important of which is the randomized treatment allocation (i.e. intervention vs. control). Other independent variables (e.g. proteinuria, blood pressure) may be used in multivariable analysis, however this is almost always a secondary analysis in a clinical trial because randomization has been used to balance all other factors between groups.

Each variable should be classified by type (eg. continuous, ordinal, categorical, dichotomous) and distribution (normal [also known as parametric or Gaussian] or non-normal). This will help you determine the right statistical test to use. This should be considered when planning the study as the type of data will result in strengths and limitations in terms of the possible mathematical tests and the interpretation of the obtained results. For example, if CKD stage is collected then one must use a categorical data analysis, however if eGFR is collected then one can use a continuous data analysis (usually more powerful) or convert them to CKD categories to use a categorical analysis. Therefore, a careful plan to adjust the study design according to the research question and the characteristics of the study variables is always desirable.

Many in the scientific community have suggested that too much emphasis is placed on P-values. In short, they are only one factor in determing how meaningful a result actually is. Further discussion of this issue can be found in the following articles:

- Craig J. Interpreting trial results – time for confidence and magnitude and not P values please. Kidney Int. 2019;95(1):28-30. [DOI 10.1016/j.kint.2018.11.006]
- Van Rijn MHC, et al. Statistical significance versus clinical relevance. Nephrol Dial Transplant. 2017;32(Suppl 2):ii6-ii12. [DOI 10.1093/ndt/gfw385]
- Wasserstein RL & Lazar NA. The ASA's Statement on p-Values: Context, Process, and Purpose. American Statistician. 2016;70(2):129-133. [DOI 10.1080/00031305.2016.1154108]
- Greenland S, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337-350. [DOI 10.1007/s10654-016-0149-3]

- TDR Implementation Research Toolkit [Detailed overview in Research methods and data management/Data analysis section]
- UCLA Institute for Digital Research and Education
- Choosing a statistical test
- UK Clinical Trials Toolkit: Statistical Data Analysis
- EMA Statistical Principles for Clinical Trials (see Section V)
- GraphPad [online calculator for simple statistical procedures]
- GraphPad Statistical Guide
*Statistical analysis plan guidelines:*- Gable C, et al. Guidelines for the Content of Statistical Analysis Plans in Clinical Trials. JAMA. 2017;318(23):2337-2343. [DOI 10.1001/jama.2017.18556]
- Statistical analysis plan template
- Cambridge Clinical Trials Unit (NHS, Cambridge University Hospitals)
- Statistical analysis plan examples
- Pascoe EM, et al. The HONEYPOT Randomized Controlled Trial Statistical Analysis Plan. Perit Dial Int. 2013;33(4):426-435. [DOI 10.3747/pdi.2012.00310]
- Hedayati SS, et al. Effect of sertraline on depressive symptoms in patients with chronic kidney disease without dialysis dependence: the CAST randomized clinical trial. JAMA. 2017;318(19):1876-1890. [DOI 10.1001/jama.2017.17131] – see Supplement 2 (Data analysis plan).