Everyone is talking about data science these days. The profusion of cheap and ubiquitous sensors and storage means that we have access to unprecedented levels of raw data. This data can certainly tell us something useful and interesting, but finding out exactly what that is and how to extract it can be a significant challenge.
Also called data analytics or “big data”, data science is a new term for a discipline which combines existing areas, including data mining, database design, machine learning, and even psychology. It can be targeted at vast, user-related data such as that held by Amazon or Google, in order to personalise the user experience by pre-emptively presenting things of high interest. It can be used to sift through a large corpus of text such as Twitter feeds to detect societal trends and moods. It can also be applied to engineering problems, where traditionally data would be collected and processed to diagnose faults in plant or machinery.
The modern tools in the data scientist’s toolbox mean that engineering data collected in the past for one purpose can be reused to give richer information about the plant in question. While data analysis in the past tended to look for step changes in key measurements, which may indicate a particular fault developing, more nuanced analysis of combinations of parameters is now much more easy to do.
For my work in data science, I most commonly use R, a statistical modelling environment which is increasingly popular with data analysts in multiple domains. R is free software distributed under the GPL-2 and GPL-3 licenses, and can therefore be used for academic, individual, or commercial purposes.
There is a good online class from Coursera called Introduction to Data Science, which covers a lot of the low level data handling aspects (SQL vs. NoSQL, comparison of the tools), along with MapReduce, machine learning, and visualisation. This comprehensive look at the data science toolkit really shows that the novelty of data analytics comes from the combination of previously separate subtopics, which come together in such a way as to provide a full stack of techniques and approachs for analysing and reporting on data.