INTRODUCTION
Data, Data, Data
Data are big, data are hard, data are experimental; data are everywhere.
Converting data into (useful) information is a challenge, but a necessary step to not only justify data collection but also appraise processes and arrive at conclusions. This chapter strives to reach such an ambitious goal. Beyond the provision of analytical methods, it aims to highlight potential pitfalls and tips for success.
We shall start by contemplating the very essence of data analysis, discussing the theoretical concepts and relevant approaches required for a particular analysis. Given the diversity in complexity and nature, we will consider whether it is possible, or desirable, to separate a part from the whole: does it make sense to look only at a subset of data to draw a conclusion? Such a question is just as valid when separating one particular dataset from another, for example when pulling apart a graphical representation from statistical analysis. A figure can only illustrate a result, not quantify its significance. Very importantly, when designing an experiment and collecting data one should always remember the words of R.A. Fisher, the eminent statistician and geneticist: ‘To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.’
Then we will get into the practicalities of data handling, particularly the actual manipulation of the data into a format required for further analysis. It may not be the most glamorous subject, but it is essential to enable the appropriate and efficient use of the data collected. Just like a well-designed experiment, well-organised data can make its collection, visualisation and analysis a smooth and enjoyable experience. Data reformatting presents frequent mishaps, including an inadvertent loss, an erroneous characterisation, a mix-up and even an outright corruption. Unfortunately, such misfortune happens all too often; so being prepared will not only minimise the impact, but help capture any eventual problem earlier, easing its fix. We will also emphasise that data handling should be directly linked to its graphical representation and expected analysis on the platform of choice. We will therefore discuss the optimal use of spreadsheets, R and Python languages (see also Chapter 18).