Statistics, data mining and machine learning in astronomy

By Zeljko Ivezic et al. Reviewed by Graham Relf
Princeton University Press 2014x + 540 pages
Price £65.00 (hbk)ISBN:978-0-691-15168-7


Professional astronomy projects generate huge amounts of data which are analysed for each particular project’s needs but are quite likely to contain other important information left undiscovered. The data are generally available on the Internet if you know where to look. This is good news for those like me who are reaching the age when lugging the ’scope down the garden to the observatory and standing around in subzero temperatures is becoming unattractive and difficult if not downright dangerous. I think it is also the case that amateurs now stand a greater chance of discovering something important among such data rather than by scanning the skies themselves, because of the increasing number of automated scanning systems. Hanny’s Voorwerp is a prime example of a truly significant amateur discovery made in the Galaxy Zoo project.

The subject of this book, data mining, is quite a step up from the Galaxy Zoo type of project. Let’s be honest, it is a complete staircase up to a whole new storey. Blocking the way up are two particular ogres whom many readers will find it difficult to befriend: statistics and programming. To make meaningful discoveries you do have to apply rigorous statistical methods to the data, and that is the real meat of this book. The book accompanies a big software project, astroML, and guides readers into downloading and analysing data by programming in a relatively simple language called Python.

The authors are three professors and a postdoc at two universities in the USA.

The book explains how to access the real data out there on the internet and get them into your Python program for analysis. It uses the Sloan Digital Sky Survey (SDSS) data for many of its detailed examples. The ‘ML’ part of astroML stands for ‘Machine Learning’. It means that the software contains sophisticated methods of analysis which you only have to call from your own much simpler program. The difficult part is knowing which techniques are valid to use. The book covers a wide range of techniques, including the latest developments. One of its strengths is that it discusses the relative merits of each in various situations.

The software is free to download and use. I found that straightforward but you do need a C++ compiler for astroML itself. AstroML sits on top of a chain of dependencies, well explained in the book: Python, NumPy, SciPy, matplotlib and optionally others. Python is not my favourite language (as a software developer I have doubts about its maintainability because it has too many clever tricks) but it has evidently been embraced by the academic science community as a step forward from good old FORTRAN. There are therefore many powerful libraries available for building upon in this style, so no-one has to reinvent the wheel.

This book makes no bones about being aimed at graduate astronomers starting out on an observational career. I am sure that if I were among them I would find the book inspirational. For the rest of us it is impressive but quite daunting.

Graham is a retired physicist who spent most of his career in software development, particularly for image processing and analysis. He is responsible for the BAA Computing Section’s website and is very keen to encourage everyone to try astrophotography.

The British Astronomical Association supports amateur astronomers around the UK and the rest of the world. Find out more about the BAA or join us.