We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure coreplatform@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this paper, we propose artificial-neural-network-based (ANN-based) nonlinear algebraic models for the large-eddy simulation (LES) of compressible wall-bounded turbulence. An innovative modification is applied to the invariants and the tensor bases of the nonlinear algebraic models through using the local grid widths along each direction to normalise the corresponding gradients of the flow variables. Furthermore, the dimensionless model coefficients are determined by the ANN method. The modified ANN-based nonlinear algebraic model (MANA model) has much higher correlation coefficients and much lower relative errors than the dynamic Smagorinsky model (DSM), Vreman model and wall-adapting local eddy-viscosity model in the a priori test. The significantly more accurate estimations of the mean subgrid-scale (SGS) fluxes of the kinetic energy and temperature variance are also obtained by the MANA models in the a priori test. Furthermore, in the a posteriori test, the MANA model can give much more accurate predictions of the flow statistics and the mean SGS fluxes of the kinetic energy and the temperature variance than other traditional eddy-viscosity models in compressible turbulent channel flows with untrained Reynolds numbers, Mach numbers and grid resolutions. The MANA model has a better performance in predicting the flow statistics in supersonic turbulent boundary layer. The MANA model can well predict both direct and inverse transfer of the kinetic energy and temperature variance, which overcomes the inherent shortcoming that the traditional eddy-viscosity models cannot predict the inverse energy transfer. Moreover, the MANA model is computationally more efficient than the DSM.
A model based on a convolutional neural network (CNN) is designed to reconstruct the three-dimensional turbulent flows beneath a free surface using surface measurements, including the surface elevation and surface velocity. Trained on datasets obtained from the direct numerical simulation of turbulent open-channel flows with a deformable free surface, the proposed model can accurately reconstruct the near-surface flow field and capture the characteristic large-scale flow structures away from the surface. The reconstruction performance of the model, measured by metrics such as the normalised mean squared reconstruction errors and scale-specific errors, is considerably better than that of the traditional linear stochastic estimation (LSE) method. We further analyse the saliency maps of the CNN model and the kernels of the LSE model and obtain insights into how the two models utilise surface features to reconstruct subsurface flows. The importance of different surface variables is analysed based on the saliency map of the CNN, which reveals knowledge about the surface–subsurface relations. The CNN is also shown to have a good generalisation capability with respect to the Froude number if a model trained for a flow with a high Froude number is applied to predict flows with lower Froude numbers. The results presented in this work indicate that the CNN is effective regarding the detection of subsurface flow structures and by interpreting the surface–subsurface relations underlying the reconstruction model, the CNN can be a promising tool for assisting with the physical understanding of free-surface turbulence.
Large-scale microdata on group identity are critical for studies on identity politics and violence but remain largely unavailable for developing countries. We use personal names to infer religion in South Asia—where religion is a salient social division, and yet, disaggregated data on it are scarce. Existing work predicts religion using a dictionary-based method and, therefore, cannot classify unseen names. We provide character-based machine-learning models that can classify unseen names too with high accuracy. Our models are also much faster and, hence, scalable to large datasets. We explain the classification decisions of one of our models using the layer-wise relevance propagation technique. The character patterns learned by the classifier are rooted in the linguistic origins of names. We apply these to infer the religion of electoral candidates using historical data on Indian elections and observe a trend of declining Muslim representation. Our approach can be used to detect identity groups across the world for whom the underlying names might have different linguistic roots.
The historical development of statistics and artificial intelligence (AI) is outlined, with machine learning (ML) emerging as the dominant branch of AI. Data science is viewed as being composed of a yin part (ML) and a yang part (statistics), and environmental data science is the intersection between data science and environmental science. Supervised learning and unsupervised learning are compared. Basic concepts of underfitting/overfitting and the curse of dimensionality are introduced.
Statistical and machine learning methods have many applications in the environmental sciences, including prediction and data analysis in meteorology, hydrology and oceanography; pattern recognition for satellite images from remote sensing; management of agriculture and forests; assessment of climate change; and much more. With rapid advances in machine learning in the last decade, this book provides an urgently needed, comprehensive guide to machine learning and statistics for students and researchers interested in environmental data science. It includes intuitive explanations covering the relevant background mathematics, with examples drawn from the environmental sciences. A broad range of topics is covered, including correlation, regression, classification, clustering, neural networks, random forests, boosting, kernel methods, evolutionary algorithms and deep learning, as well as the recent merging of machine learning and physics. End‑of‑chapter exercises allow readers to develop their problem-solving skills, and online datasets allow readers to practise analysis of real data.
Causal inference and machine learning are typically introduced in the social sciences separately as theoretically distinct methodological traditions. However, applications of machine learning in causal inference are increasingly prevalent. This Element provides theoretical and practical introductions to machine learning for social scientists interested in applying such methods to experimental data. We show how machine learning can be useful for conducting robust causal inference and provide a theoretical foundation researchers can use to understand and apply new methods in this rapidly developing field. We then demonstrate two specific methods – the prediction rule ensemble and the causal random forest – for characterizing treatment effect heterogeneity in survey experiments and testing the extent to which such heterogeneity is robust to out-of-sample prediction. We conclude by discussing limitations and tradeoffs of such methods, while directing readers to additional related methods available on the Comprehensive R Archive Network (CRAN).
Research of judges and courts traditionally centers on judgments, treating each judgment as a unit of observation. However, judgments often address multiple distinct and more or less unrelated issues. Studying judicial behavior on a judgment level therefore loses potentially important details and risks drawing false conclusions from the data. We present a method to assist researchers with splitting judgments by issues using a supervised machine learning classifier. Applying our approach to splitting judgments by the Court of Justice of the European Union into issues, we show that this approach is practically feasible and provides benefits for text-based analysis of judicial behavior.
Alternating direction method of multipliers (ADMM) receives much attention in the field of optimization and computer science, etc. The generalized ADMM (G-ADMM) proposed by Eckstein and Bertsekas incorporates an acceleration factor and is more efficient than the original ADMM. However, G-ADMM is not applicable in some models where the objective function value (or its gradient) is computationally costly or even impossible to compute. In this paper, we consider the two-block separable convex optimization problem with linear constraints, where only noisy estimations of the gradient of the objective function are accessible. Under this setting, we propose a stochastic linearized generalized ADMM (called SLG-ADMM) where two subproblems are approximated by some linearization strategies. And in theory, we analyze the expected convergence rates and large deviation properties of SLG-ADMM. In particular, we show that the worst-case expected convergence rates of SLG-ADMM are
$\mathcal{O}\left( {{N}^{-1/2}}\right)$
and
$\mathcal{O}\left({\ln N} \cdot {N}^{-1}\right)$
for solving general convex and strongly convex problems, respectively, where N is the iteration number, similarly hereinafter, and with high probability, SLG-ADMM has
$\mathcal{O}\left ( \ln N \cdot N^{-1/2} \right ) $
and
$\mathcal{O}\left ( \left ( \ln N \right )^{2} \cdot N^{-1} \right ) $
constraint violation bounds and objective error bounds for general convex and strongly convex problems, respectively.
Risk of suicide-related behaviors is elevated among military personnel transitioning to civilian life. An earlier report showed that high-risk U.S. Army soldiers could be identified shortly before this transition with a machine learning model that included predictors from administrative systems, self-report surveys, and geospatial data. Based on this result, a Veterans Affairs and Army initiative was launched to evaluate a suicide-prevention intervention for high-risk transitioning soldiers. To make targeting practical, though, a streamlined model and risk calculator were needed that used only a short series of self-report survey questions.
Methods
We revised the original model in a sample of n = 8335 observations from the Study to Assess Risk and Resilience in Servicemembers-Longitudinal Study (STARRS-LS) who participated in one of three Army STARRS 2011–2014 baseline surveys while in service and in one or more subsequent panel surveys (LS1: 2016–2018, LS2: 2018–2019) after leaving service. We trained ensemble machine learning models with constrained numbers of item-level survey predictors in a 70% training sample. The outcome was self-reported post-transition suicide attempts (SA). The models were validated in the 30% test sample.
Results
Twelve-month post-transition SA prevalence was 1.0% (s.e. = 0.1). The best constrained model, with only 17 predictors, had a test sample ROC-AUC of 0.85 (s.e. = 0.03). The 10–30% of respondents with the highest predicted risk included 44.9–92.5% of 12-month SAs.
Conclusions
An accurate SA risk calculator based on a short self-report survey can target transitioning soldiers shortly before leaving service for intervention to prevent post-transition SA.
This work deals with the investigation and modelling of wall pressure fluctuations induced by a supersonic jet over a tangential flat plate. The analysis is performed at several nozzle pressure ratios around the nozzle design Mach number, including slightly over-expanded and under-expanded conditions, and for different radial positions of the rigid plate. Pitot measurements and flow visualizations through the background oriented schlieren technique provided a general overview of the aerodynamic interactions between the jet flow and the plate at the different regimes and configurations. Wall pressure fluctuations were measured using a couple of piezoelectric pressure transducers flush mounted over the plate surface. The spectral analysis has been carried out to clarify the effect of the plate position on the single and multivariate wall pressure statistics, including the screech tone amplitude. The experimental dataset is used to assess and validate a surrogate model based on artificial neural networks. Sound pressure levels and coherence functions are modelled by means of a single fully connected network, built on the basis of a recently implemented fully deterministic topology optimization algorithm. The metamodel uncertainty is also quantified using the spatial correlation function. It is shown that the flow behaviour as well as the screech and broadband noise signatures are significantly influenced by the presence of the plate, and the effects on spectral quantities are correctly reproduced by the proposed data-driven model that provides predictions in agreement with the available data.
Converging evidence suggests that a subgroup of bipolar disorder (BD) with an early age at onset (AAO) may develop from aberrant neurodevelopment. However, the definition of early AAO remains unprecise. We thus tested which age cut-off for early AAO best corresponds to distinguishable neurodevelopmental pathways.
Methods
We analyzed data from the FondaMental Advanced Center of Expertise-Bipolar Disorder cohort, a naturalistic sample of 4421 patients. First, a supervised learning framework was applied in binary classification experiments using neurodevelopmental history to predict early AAO, defined either with Gaussian mixture models (GMM) clustering or with each of the different cut-offs in the range 14 to 25 years. Second, an unsupervised learning approach was used to find clusters based on neurodevelopmental factors and to examine the overlap between such data-driven groups and definitions of early AAO used for supervised learning.
Results
A young cut-off, i.e. 14 up to 16 years, induced higher separability [mean nested cross-validation test AUROC = 0.7327 (± 0.0169) for ⩽16 years]. Predictive performance deteriorated increasing the cut-off or setting early AAO with GMM. Similarly, defining early AAO below 17 years was associated with a higher degree of overlap with data-driven clusters (Normalized Mutual Information = 0.41 for ⩽17 years) relatively to other definitions.
Conclusions
Early AAO best captures distinctive neurodevelopmental patterns when defined as ⩽17 years. GMM-based definition of early AAO falls short of mapping to highly distinguishable neurodevelopmental pathways. These results should be used to improve patients' stratification in future studies of BD pathophysiology and biomarkers.
The coronavirus disease 2019 (COVID-19) pandemic has led us to use virtual solutions and emerging technologies such as artificial intelligence (AI). Recent studies have clearly demonstrated the role of AI in health care and medical practice; however, a comprehensive review can identify potential yet not fulfilled functionalities of such technologies in pandemics. Therefore, this scoping review study aims at assessing AI functionalities in the COVID-19 pandemic in 2022.
Methods:
A systematic search was carried out in PubMed, Cochran Library, Scopus, Science Direct, ProQuest, and Web of Science from 2019 to May 9, 2022. Researchers selected the articles according to the search keywords. Finally, the articles mentioning the functionalities of AI in the COVID-19 pandemic were evaluated. Two investigators performed this process.
Results:
Initial search resulted in 9123 articles. After reviewing the title, abstract, and full text of these articles, and applying the inclusion and exclusion criteria, 4 articles were selectd for the final analysis. All 4 were cross-sectional studies. Two studies (50%) were performed in the United States, 1 (25%) in Israel, and 1 (25%) in Saudi Arabia. They covered the functionalities of AI in the prediction, detection, and diagnosis of COVID-19.
Conclusions:
To the extent of the researchers’ knowledge, this study is the first scoping review that assesses the AI functionalities in the COVID-19 pandemic. Health-care organizations need decision support technologies and evidence-based apparatuses that can perceive, think, and reason not dissimilar to human beings. Potential functionalities of such technologies can be used to predict mortality, detect, screen, and trace current and former patients, analyze health data, prioritize high-risk patients, and better allocate hospital resources in pandemics, and generally in health-care settings.
Current categorical classification systems of psychiatric diagnoses lead to heterogeneity of symptoms within disorders and common co-occurrence of disorders. We investigated the heterogeneous and overlapping nature of symptom endorsement in a population-based sample across three of the most common categories of psychiatric disorders: depressive disorders, anxiety disorders, and sleep–wake disorders using unsupervised machine learning approaches.
Methods
We assessed a total of 43 symptoms in a discovery sample of 6,602 participants of the population-based Rotterdam Study between 2009 and 2013, and in a replication sample of 3,005 participants between 2016 and 2020. Symptoms were assessed using the Center for Epidemiologic Studies Depression Scale, the Hospital Anxiety and Depression Scale, and the Pittsburgh Sleep Quality Index. Hierarchical clustering analysis was applied on test items and participants to investigate common patterns of symptoms co-occurrence, and further quantitatively investigated with clustering methods to find groups that may represent similar psychiatric phenotypes.
Results
First, clustering analyses of the questionnaire items suggested a three-cluster solution representing clusters of “mixed” symptoms, “depressed affect and nervousness”, and “troubled sleep and interpersonal problems”. A highly similar clustering solution was independently established in the replication sample. Second, four groups of participants could be separated, and these groups scored differently on the item clusters.
Conclusions
We identified three clusters of psychiatric symptoms that most commonly co-occur in a population-based sample. These symptoms clustered stable over samples, but across the topics of depression, anxiety, and poor sleep. We identified four groups of participants that share (sub)clinical symptoms and might benefit from similar prevention or treatment strategies, despite potentially diverging, or lack of, diagnoses.
We propose and explore the possibility that language models can be studied as effective proxies for specific human subpopulations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the “algorithmic bias” within one such tool—the GPT-3 language model—is instead both fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property algorithmic fidelity and explore its extent in GPT-3. We create “silicon samples” by conditioning the model on thousands of sociodemographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and sociocultural context that characterize human attitudes. We suggest that language models with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines.
Explainability is highly desired in machine learning (ML) systems supporting high-stakes policy decisions in areas such as health, criminal justice, education, and employment. While the field of explainable ML has expanded in recent years, much of this work has not taken real-world needs into account. A majority of proposed methods are designed with generic explainability goals without well-defined use cases or intended end users and evaluated on simplified tasks, benchmark problems/datasets, or with proxy users (e.g., Amazon Mechanical Turk). We argue that these simplified evaluation settings do not capture the nuances and complexities of real-world applications. As a result, the applicability and effectiveness of this large body of theoretical and methodological work in real-world applications are unclear. In this work, we take steps toward addressing this gap for the domain of public policy. First, we identify the primary use cases of explainable ML within public policy problems. For each use case, we define the end users of explanations and the specific goals the explanations have to fulfill. Finally, we map existing work in explainable ML to these use cases, identify gaps in established capabilities, and propose research directions to fill those gaps to have a practical societal impact through ML. The contribution is (a) a methodology for explainable ML researchers to identify use cases and develop methods targeted at them and (b) using that methodology for the domain of public policy and giving an example for the researchers on developing explainable ML methods that result in real-world impact.
Automatically extracting knowledge from small datasets with a valid causal ordering is a challenge for current state-of-the-art methods in machine learning. Extracting other type of knowledge is important but challenging for multiple engineering fields where data are scarce and difficult to collect. This research aims to address this problem by presenting a machine learning-based modeling framework leveraging the knowledge available in fundamental units of the variables recorded from data samples, to develop parsimonious, explainable, and graph-based simulation models during the early design stages. The developed approach is exemplified using an engineering design case study of a spherical body moving in a fluid. For the system of interest, two types of intricated models are generated by (1) using an automated selection of variables from datasets and (2) combining the automated extraction with supplementary knowledge about functions and dimensional homogeneity associated with the variables of the system. The effect of design, data, model, and simulation specifications on model fidelity are investigated. The study discusses the interrelationships between fidelity levels, variables, functions, and the available knowledge. The research contributes to the development of a fidelity measurement theory by presenting the premises of a standardized, modeling approach for transforming data into measurable level of fidelities for the produced models. This research shows that structured model building with a focus on model fidelity can support early design reasoning and decision making using for example the dimensional analysis conceptual modeling (DACM) framework.
Archaeologists tend to produce slow data that is contextually rich but often difficult to generalize. An example is the analysis of lithic microdebitage, or knapping debris, that is smaller than 6.3 mm (0.25 in.). So far, scholars have relied on manual approaches that are prone to intra- and interobserver errors. In the following, we present a machine learning–based alternative together with experimental archaeology and dynamic image analysis. We use a dynamic image particle analyzer to measure each particle in experimentally produced lithic microdebitage (N = 5,299) as well as an archaeological soil sample (N = 73,313). We have developed four machine learning models based on Naïve Bayes, glmnet (generalized linear regression), random forest, and XGBoost (“Extreme Gradient Boost[ing]”) algorithms. Hyperparameter tuning optimized each model. A random forest model performed best with a sensitivity of 83.5%. It misclassified only 28 or 0.9% of lithic microdebitage. XGBoost models reached a sensitivity of 67.3%, whereas Naïve Bayes and glmnet models stayed below 50%. Except for glmnet models, transparency proved to be the most critical variable to distinguish microdebitage. Our approach objectifies and standardizes microdebitage analysis. Machine learning allows studying much larger sample sizes. Algorithms differ, though, and a random forest model offers the best performance so far.
This study proposes a newly developed deep-learning-based method to generate turbulent inflow conditions for spatially developing turbulent boundary layer (TBL) simulations. A combination of a transformer and a multiscale-enhanced super-resolution generative adversarial network is utilised to predict velocity fields of a spatially developing TBL at various planes normal to the streamwise direction. Datasets of direct numerical simulation (DNS) of flat plate flow spanning a momentum thickness-based Reynolds number, $Re_\theta = 661.5\unicode{x2013}1502.0$, are used to train and test the model. The model shows a remarkable ability to predict the instantaneous velocity fields with detailed fluctuations and reproduce the turbulence statistics as well as spatial and temporal spectra with commendable accuracy as compared with the DNS results. The proposed model also exhibits a reasonable accuracy for predicting velocity fields at Reynolds numbers that are not used in the training process. With the aid of transfer learning, the computational cost of the proposed model is considered to be effectively low. Furthermore, applying the generated turbulent inflow conditions to an inflow–outflow simulation reveals a negligible development distance for the TBL to reach the target statistics. The results demonstrate for the first time that transformer-based models can be efficient in predicting the dynamics of turbulent flows. They also show that combining these models with generative adversarial networks-based models can be useful in tackling various turbulence-related problems, including the development of efficient synthetic-turbulent inflow generators.
This chapter explores the potential for gamesmanship in technology-assisted discovery.1 Attorneys have long embraced gamesmanship strategies in analog discovery, producing reams of irrelevant documents, delaying depositions, or interpreting requests in a hyper-technical manner.
2 The new question, however, is whether machine learning technologies can transform gaming strategies. By now it is well known that technologies have reinvented the practice of civil litigation and, specifically, the extensive search for relevant documents in complex cases. Many sophisticated litigants use machine learning algorithms – under the umbrella of “Technology Assisted Review” (TAR) – to simplify the identification and production of relevant documents in discovery.3 Litigants employ TAR in cases ranging from antitrust to environmental law, civil rights, and employment disputes. But as the field becomes increasingly influenced by engineers and technologists, a string of commentators has raised questions about TAR, including lawyers’ professional role, underlying incentive structures, and the dangers of new forms of gamesmanship and abuse.4