Please note, due to essential maintenance online transactions will not be possible between 02:30 and 04:00 BST, on Tuesday 17th September 2019 (22:30-00:00 EDT, 17 Sep, 2019). We apologise for any inconvenience.
To send this article to your account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about sending content to .
To send this article to your Kindle, first ensure firstname.lastname@example.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.
Find out more about sending to your Kindle.
Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The history and current status of the cross-disciplinary fields of astrostatistics and astroinformatics are reviewed. Astronomers need a wide range of statistical methods for both data reduction and science analysis. With the proliferation of high-throughput telescopes, efficient large scale computational methods are also becoming essential. However, astronomers receive only weak training in these fields during their formal education. Interest in the fields is rapidly growing with conferences organized by scholarly societies, textbooks and tutorial workshops, and research studies pushing the frontiers of methodology. R, the premier language of statistical computing, can provide an important software environment for the incorporation of advanced statistical and computational methodology into the astronomical community.
We describe the implementation and performance results of our massively parallel MPI†/OpenMP‡ hybrid TreePM code for large-scale cosmological N-body simulations. For domain decomposition, a recursive multi-section algorithm is used and the size of domains are automatically set so that the total calculation time is the same for all processes. We developed a highly-tuned gravity kernel for short-range forces, and a novel communication algorithm for long-range forces. For two trillion particles benchmark simulation, the average performance on the fullsystem of K computer (82,944 nodes, the total number of core is 663,552) is 5.8 Pflops, which corresponds to 55% of the peak speed.
We describe here the parallels in astronomy and earth science datasets, their analyses, and the opportunities for methodology transfer from astroinformatics to geoinformatics. Using example of hydrology, we emphasize how meta-data and ontologies are crucial in such an undertaking. Using the infrastructure being designed for EarthCube - the Virtual Observatory for the earth sciences - we discuss essential steps for better transfer of tools and techniques in the future e.g. domain adaptation. Finally we point out that it is never a one-way process and there is enough for astroinformatics to learn from geoinformatics as well.
Astronomy is rapidly approaching an impasse: very large datasets require remote or cloud-based parallel processing, yet many astronomers still try to download the data and develop serial code locally. Astronomers understand the need for change, but the hurdles remain high. We are developing a data archive designed from the ground up to simplify and encourage cloud-based parallel processing. While the volume of data we host remains modest by some standards, it is still large enough that download and processing times are measured in days and even weeks. We plan to implement a python based, notebook-like interface that automatically parallelises execution. Our goal is to provide an interface sufficiently familiar and user-friendly that it encourages the astronomer to run their analysis on our system in the cloud—astroinformatics as a service. We describe how our system addresses the approaching impasse in astronomy using the SAMI Galaxy Survey as an example.
Statistical studies of active galaxies (both AGN and Starburst) using large multi-wavelength data are presented, including new studies of Markarian galaxies, large sample of IR galaxies, variable radio sources, and large homogeneous sample of X-ray selected AGN. Markarian survey (the First Byurakan Survey) was digitized and the DFBS database was created, as the biggest spectroscopic database by the number of objects involved ( ~ 20 million). This database provides both 2D images and 1D spectra. We have carried out a number of projects aimed at revealing and multi-wavelength studies of active galaxies among optical, X-ray, IR and radio sources. Thousands of X-ray sources were identified from ROSAT, including many AGN (52% among all identified sources). IRAS PSC/FSC sources were studied having accurate positions from WISE and a large extragalactic sample was created for further search for AGNs. The fraction of active galaxies among IR-selected galaxies was estimated as 24%. Variable radio sources at 1.4 GHz were revealed by cross-correlation of NVSS and FIRST catalogues using the method introduced by us for optical variability. Radio-X-ray sources were revealed from NVSS and ROSAT for detection of new active galaxies. Big Data in astronomy is described that provide new possibilities for statistical research of active galaxies and other objects.
Throughout the processing and analysis of survey data, a ubiquitous issue nowadays is that we are spoilt for choice when we need to select a methodology for some of its steps. The alternative methods usually fail and excel in different data regions, and have various advantages and drawbacks, so a combination that unites the strengths of all while suppressing the weaknesses is desirable. We propose to use a two-level hierarchy of learners. Its first level consists of training and applying the possible base methods on the first part of a known set. At the second level, we feed the output probability distributions from all base methods to a second learner trained on the remaining known objects. Using classification of variable stars and photometric redshift estimation as examples, we show that the hierarchical combination is capable of achieving general improvement over averaging-type combination methods, correcting systematics present in all base methods, is easy to train and apply, and thus, it is a promising tool in the astronomical “Big Data” era.
As we enter the era of gravitational wave astronomy, we are beginning to collect observations which will enable us to explore aspects of astrophysics of massive stellar binaries which were previously beyond reach. In this paper we describe COMPAS (Compact Object Mergers: Population Astrophysics and Statistics), a new platform to allow us to deepen our understanding of isolated binary evolution and the formation of gravitational-wave sources. We describe the computational challenges associated with their exploration, and present preliminary results on overcoming them using Gaussian process regression as a simulation emulation technique.
We study the effects of mergers on non-parametric morphologies of galaxies. We compute the Gini index, M20, asymmetry and concentration statistics for z = 0 galaxies in the Illustris simulation and compare non-parametric morphologies of major mergers, minor merges, close pairs, distant pairs and unperturbed galaxies. We determine the effectiveness of observational methods based on these statistics to select merging galaxies.
Excess thermal energy within a Charged Coupled Device (CCD) results in excess electrical current that is trapped within the lattice structure of the electronics. This excess signal from the CCD itself can be present through multiple exposures, which will have an adverse effect on its science performance unless it is corrected for. The traditional way to correct for this extra charge is to take occasional long-exposure images with the camera shutter closed. These images, generally referred to as “dark” images, allow for the measurement of thermal-electron contamination at each pixel of the CCD. This so-called “dark current” can then be subtracted from the science images by re-scaling to the science exposure times. Pixels that have signal above a certain value are traditionally marked as “hot” and flagged in the data quality array. Many users will discard these pixels as being bad. However, these pixels may not be bad in the sense that they cannot be reliably dark-subtracted; if these pixels are shown to be stable over a given anneal period, the charge can be properly subtracted and the extra Poisson noise from this dark current can be taken into account and put into the error arrays.
The structure of photospheric magnetic fields outside sunspots is investigated in three active regions using Hinode/Solar Optical Telescope(SOT) observations. We analyze Zeeman effect in FeI 6301.5 and FeI 6302.5 lines and determine the observed magnetic field value Beff for each of them. We find that the line ratio Beff(6301)/Beff(6302) is close to 1.3 in the range Beff < 0.2 kG, and close to 1.0 for 0.8 kG < Beff < 1.2 kG. We find that the observed magnetic field is formed by flux tubes with the magnetic field strengths 1.3 − 2.3 kG even in places with weak observed magnetic field fluxes. We also estimate the diameters of smallest magnetic flux tubes to be 15 − 20 km.
Each CCD of LAMOST accommodates 250 spectra, while about 40 are used to observe sky background during real observations. How to estimate the unknown sky background information hidden in the observed 210 celestial spectra by using the known 40 sky spectra is the problem we solve. In order to model the sky background, usually a pre-observation is performed with all fibers observing sky background. We use the observed 250 skylight spectra as training data, where those observed by the 40 fibers are considered as a base vector set. The Locality-constrained Linear Coding (LLC) technique is utilized to represent the skylight spectra observed by the 210 fibers with the base vector set. We also segment each spectrum into small parts, and establish the local sky background model for each part. Experimental results validate the proposed method, and show the local model is better than the global model.
Compressive Sensing is an emerging technology for data compression and simultaneous data acquisition. This is an enabling technique for significant reduction in data bandwidth, and transmission power and hence, can greatly benefit space-flight instruments. We apply this process to detect exoplanets via gravitational microlensing. We experiment with various impact parameters that describe microlensing curves to determine the effectiveness and uncertainty caused by Compressive Sensing. Finally, we describe implications for space-flight missions.
Euclid is a Europe-led cosmology space mission dedicated to a visible and near infrared survey of the entire extra-galactic sky. Its purpose is to deepen our knowledge of the dark content of our Universe. After an overview of the Euclid mission and science, this contribution describes how the community is getting organized to face the data analysis challenges, both in software development and in operational data processing matters. It ends with a more specific account of some of the main contributions of the Swiss Science Data Center (SDC-CH).
LSST is a next generation telescope that will produce an unprecedented data flow. The project goal is to deliver data products such as images and catalogs thus enabling scientific analysis for a wide community of users. As a large scale survey, LSST data will be complementary with other facilities in a wide range of scientific domains, including data from ESA or ESO. European countries have invested in LSST since 2007, in the construction of the camera as well as in the computing effort. This latter will be instrumental in designing the next step: how to distribute LSST data to Europe. Astroinformatics challenges for LSST indeed includes not only the analysis of LSST big data, but also the practical efficiency of the data access.
The Large Synoptic Survey Telescope (LSST), the next-generation optical imaging survey sited at Cerro Pachon in Chile, will provide an unprecedented database of astronomical measurements. The LSST design, with an 8.4m (6.7m effective) primary mirror, a 9.6 sq. deg. field of view, and a 3.2 Gigapixel camera, will allow about 10,000 sq. deg. of sky to be covered twice per night, every three to four nights on average, with typical 5-sigma depth for point sources of r=24.5 (AB). With over 800 observations in ugrizy bands over a 10-year period, these data will enable a deep stack reaching r=27.5 (about 5 magnitudes deeper than SDSS) and faint time-domain astronomy. The measured properties of newly discovered and known astrometric and photometric transients will be publicly reported within 60 sec after observation. The vast database of about 30 trillion observations of 40 billion objects will be mined for the unexpected and used for precision experiments in astrophysics. In addition to a brief introduction to LSST, we discuss a number of astro-statistical challenges that need to be overcome to extract maximum information and science results from LSST dataset.
The tens of millions of radio sources to be detected with next-generation surveys pose new challenges, quite apart from the obvious ones of processing speed and data volumes. For example, existing algorithms are inadequate for source extraction or cross-matching radio and optical/IR sources, and a new generation of algorithms are needed using machine learning and other techniques. The large numbers of sources enable new ways of testing astrophysical models, using a variety of “large-n astronomy” techniques such as statistical redshifts. Furthermore, while unexpected discoveries account for some of the most significant discoveries in astronomy, it will be difficult to discover the unexpected in large volumes of data, unless specific software is developed to mine the data for the unexpected.
The Hubble Source Catalog (HSC) is designed to enhance the science obtained from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) across filters and detectors into a single master catalog. The catalog contains data from the major Hubble imaging instruments: Wide Field Planetary Camera 2 (WFPC2), Advanced Camera for Surveys (ACS), and Wide Field Camera 3 (WFC3). It is based on cross matching and astrometry algorithms developed by Budavari & Lubow (2012). We recently released Version 2 that is three times the size of Version 1 and includes some new features. The catalog can be accessed through a variety of interfaces (see http://archive.stsci.edu/hst/hsc/). The HSC provides descriptions of astronomical objects involving multiple wavelengths and epochs. High relative positional accuracy of objects is achieved across the Hubble images, often with sub-pixel precision of a few milliarcseconds.
For examining possibilities and challenges in doing science with multi-band and non-simultaneous data from upcoming surveys like LSST, the Pan-STARRS1 (PS1) 3π can be used as a pilot survey. This is especially important to explore the possibilities in detection and classification of variable sources within the first years of LSST’s 10-year baseline. We had explored the capabilities of PS1 3π for carrying out time-domain science in a variety of applications. We had used structure function fitting as well as period fitting, to search for and classify high-latitude as well as low-latitude variable sources, in particular RR Lyrae, Cepheids and QSOs.
We present simulator of alerts for the Large Synoptic Survey Telescope (LSST) developed by Belgrade group. This simulator will be used in testing the functionality of external event brokers/Complex Event Processing (CEP) engines. It is based on current LSST Simulation framework and allows for different classes of objects to be ‘alerted’. A Web service based on our simulator is prototyped and can be accessed by developers of brokers/CEP engines.
An introduction is given to the use of prototype-based models in supervised machine learning. The main concept of the framework is to represent previously observed data in terms of so-called prototypes, which reflect typical properties of the data. Together with a suitable, discriminative distance or dissimilarity measure, prototypes can be used for the classification of complex, possibly high-dimensional data. We illustrate the framework in terms of the popular Learning Vector Quantization (LVQ). Most frequently, standard Euclidean distance is employed as a distance measure. We discuss how LVQ can be equipped with more general dissimilarites. Moreover, we introduce relevance learning as a tool for the data-driven optimization of parameterized distances.