We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure coreplatform@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter describes methods based on gradient information that achieve faster rates than basic algorithms such as those described in Chapter 3. These accelerated gradient methods, most notably the heavy-ball method and Nesterov’s optimal method, use the concept of momentum which means that each step combines information from recent gradient values but also earlier steps. These methods are described and analyzed using an analysis based on Lyapunov functions. The cases of convex and strongly convex functions are analyzed separately. We motivate these methods using continuous-time limits, which link gradient methods to dynamical systems described by differential equations. We mention also the conjugate gradient method, which was developed separately from the other method but which also makes use of momentum. Finally, we discuss the concept of lower bounds on algorithmic complexity, introducing a function on which no method based on gradients can attain convergence faster than a certain given rate.
Here, we describe methods for minimizing a smooth function over a closed convex set, using gradient information. We first state results that characterize optimality of points in a way that can be checked, and describe the vital operation of projection onto the feasible set. We next describe the projected gradient algorithm, which is in a sense the extension of the steepest-descent method to the constrained case, analyze its convergence, and describe several extensions. We next analyze the conditional-gradient method (also known as “Frank-Wolfe”) for the case in which the feasible set is compact and demonstrate sublinear convergence of this approach when the objective function is convex.
Here, we discuss concepts of duality for convex optimization problems, and algorithms that make use of these concepts. We define the Lagrangian function and its augmented Lagrangian counterpart. We use the Lagrangian to derive optimality conditions for constrained optimization problems in which the constraints are expressed as linear algebraic conditions. We introduce the dual problem, and discuss the concepts of weak and strong duality, and show the existence of positive duality gaps in certain settings. Next, we discuss the dual subgradient method, the augmented Lagrangian method, and the alternating direction method of multipliers (ADMM), which are useful for several types of data science problems.
In this introductory chapter, we outline the ways in which various problems in data analysis can be formulated as optimization problems. Specifically, we discuss least squares problems, problems in matrix optimization (particularly those involving low-rank matrices), linear and kernel support vector machines, binary and multiclass logistic regression, and deep learning. We also outline the scope of the remainder of the book.
We describe the stochastic gradient method, the fundamental algorithm for several important problems in data science, including deep learning. We give several example problems for which this method is suitable, then described its operation for the simple problem of computing a mean of a collection of values. We related it to a classical method, the Kaczmarz method for solving a system of linear equalities and inequalities. Next, we describe the key assumptions to be used in convergence analysis, then describe the convergence rates attainable by several variants of stochastic gradient under several scenarios. Finally, we discuss several aspects of practical implementation of stochastic gradient, including minibatching and acceleration.
We outline theoretical foundations for smooth optimization problems. First, we define the different types of minimizers (solutions) of unconstrained optimization problems. Next, we state Taylor’s theorem, the fundamental theorem of smooth optimization, which allows us to approximate general smooth functions by simpler (linear or quadratic) functions based on information at the current point. We show how minima can be characterized by optimality conditions involving the gradient or Hessian, which can be checked in practice. Finally, we define the convexity of sets and functions, an important property that arises often in practice and that can be exploited by the algorithms described in the remainder of the book.
This chapter describes the coordinate descent approach, in which a single variable (or a block of variables) is updated at each iteration, usually based on partial derivative information for those variables, while the remainder are left unchanged. We describe two problems in machine learning for which this approach has potential advantages relative to the approaches described in previous chapters (which make use of the full gradient), and present convergence analyses for the randomized and cyclic versions of this approach. We show that convergence rates of block coordinate descent methods can be analyzed in a similar fashion to the basic single-component methods.
Here, we describe algorithms for minimizing nonsmooth functions and composite nonsmooth functions, which are the sum of a smooth function and a (usually elementary) nonsmooth function. We start with the subgradient descent method, whose search direction is the minimum-norm element of the subgradient. We then discuss the subgradient method, which steps along an arbitrary direction drawn from the subdifferential. Next, we describe proximal-gradient algorithms for nonsmooth composite optimization, which make use of the gradient of the smooth part of the function and the proximal operator associated with the nonsmooth part. Finally, we describe the proximal point method, a framework optimization that is valuable both as a fundamental method in its own right and as a building block for the augmented Lagrangian approach described in the next chapter.
First derivatives (gradients) are needed for most of the algorithms described in the book. Here, we describe how these gradients can be computed efficiently for functions that have the form of arising in deep learning. The reverse mode of automatic differentiation, often called “back-propagation” in the machine learning community, is described for several problems with nested-composite and progressive structure that arises in neural network training. We provide another perspective on these techniques, based on a constrained optimization formulation and optimality conditions for this formulation.
Here, we define subgradients and subdifferentials of nonsmooth functions. These are a generalization of the concept of gradients for smooth functions, that can be used as the basis of algorithms. We relate subgradients to directional derivatives and to the normal cones associated with convex sets. We introduce composite nonsmooth functions that arise in regularized optimization formulations of data analysis problems and describe optimality conditions for minimizers of these functions. Finally, we describe proximal operators and the Moreau envelope, objects associated with nonsmooth functions that are the basis of algorithms for nonsmooth optimization described in the next chapter.
In this section, we discuss fundamental methods, mostly based on gradient information, that yield descent, that is, the function value decreases at each iteration. We start with the most basic method, the steepest-descent method, analyzing its convergence under different convexity/nonconvexity assumptions on the objective function. We then discuss more general descent methods, based on descent directions other than the negative gradient, showing conditions on the search direction and the steplength that allow convergence results to be proved. We also discuss a method that also makes use of Hessian information, showing that it can find a point satisfying approximate second-order optimality conditions and finding an upper bound on the number of iterations required to do so. We then discuss mirror descent, a class of gradient methods based on more general distance metrics that are particularly useful in optimizing over the unit simplex – a problem that arises often in data science. We conclude by discussing the PL condition, a generalization of the strong convexity condition that allows linear convergence rates to be proved.
The opioid epidemic in the United States is getting worse: in 2020 opioid overdose deaths hit an all-time high of 92,183. This underscored the need for more effective and readily available treatments for patients with opioid use disorder (OUD). Prescription digital therapeutics (PDTs) are FDA-authorized treatments delivered via mobile devices (eg, smartphones). A real-world pilot study was conducted in an outpatient addiction treatment program to evaluate patient engagement and use of a PDT for patients with OUD. The objective was to assess the ability of the PDT to improve engagement and care for patients receiving buprenorphine medication for opioid use disorder (MOUD).
Methods
Patients with OUD treated at an ambulatory addiction treatment clinic were invited to participate in the pilot. The reSET-O PDT is comprised of 31 core therapy lessons plus 36 supplementary lessons, plus contingency management rewards. Patients were asked to complete at least 4 lessons per week, for 12-weeks. Engagement and use data were collected via the PDT and rates of emergency room data were obtained from patient medical records. Data were compared to a similar group of 158 OUD patients treated at the same clinic who did not use the PDT. Abstinence data were obtained from deidentified medical records.
Results
Pilot participants (N = 40) completed a median of 24 lessons: 73.2% completed at least 8 lessons and 42.5% completed all 31 core lessons. Pilot participants had significantly higher rates of abstinence from opioids in the 30 days prior to discharge from the program than the comparison group: 77.5% vs 51.9% (P < .01). Clinician-reported treatment retention for pilot participants vs the comparison group was 100% vs 70.9% 30 days after treatment initiation (P < .01), 87.5% vs 55.1% at 90 days post-initiation (P < .01), and 45.0% vs 38.6% at 180 days post-initiation (P = .46). Emergency room visits within 90 days of discharge from the addiction program were significantly reduced in pilot participants compared to the comparison group (17.3% vs 31.7%, P < .01).
Conclusions
These results demonstrate substantial engagement with a PDT in a real-world population of patients with OUD being treated with buprenorphine. Abstinence and retention outcomes were high compared to patients not using the PDT. These results demonstrate the potential value of PDTs to improve outcomes among patients with OUD, a population for which a significant need for improved treatments exists.
Funding
Trinity Health Innovation and Pear Therapeutics Inc.
Optimization techniques are at the core of data science, including data analysis and machine learning. An understanding of basic optimization techniques and their fundamental properties provides important grounding for students, researchers, and practitioners in these areas. This text covers the fundamentals of optimization algorithms in a compact, self-contained way, focusing on the techniques most relevant to data science. An introductory chapter demonstrates that many standard problems in data science can be formulated as optimization problems. Next, many fundamental methods in optimization are described and analyzed, including: gradient and accelerated gradient methods for unconstrained optimization of smooth (especially convex) functions; the stochastic gradient method, a workhorse algorithm in machine learning; the coordinate descent approach; several key algorithms for constrained optimization problems; algorithms for minimizing nonsmooth functions arising in data science; foundations of the analysis of nonsmooth functions and optimization duality; and the back-propagation approach, relevant to neural networks.
Studies show the prevalence of Autism Spectrum Conditions in Early Intervention in Psychosis (EIP) populations is 3.6-3.7%, compared to approximately 1-1.5% in the general population. The CAARMS (Comprehensive Assessment of At Risk Mental States) is a national tool used by EIP services as a screening tool to bring patients into services and stratify their symptoms to determine what pathway may be most appropriate (First Episode Psychosis pathway (FEP) or At Risk Mental State pathway (ARMS)). As far as we are aware the CAARMS has not been validated in an autistic population. It is our view that several of the questions in the CAARMS may be interpreted differently by people with autism, thus affecting the scores. The aim of this evaluation was to identify whether CAARMS scores differ between patients diagnosed with autism and matched controls in York EIP.
Method
From their mental health records, we identified all patients in the service with a diagnosis of autism. We then compared the CAARMS scores, at the time of referral, to those of age matched controls (matched by being in the age range 16-30) without an autism diagnosis, using continuous sampling by date of referral.
Result
14 patients in the service had a diagnosis of autism and had completed a CAARMS. CAARMS domains are all scored between 0 and 6 (indicating increasing severity or frequency). Compared to the age matched controls, autistic patients had a higher mean difference in their scores for ‘Non-Bizarre Ideas’ (mean difference of 0.86 for severity and 0.57 for frequency) and ‘Disorganised Speech’ (mean difference of 0.28 for severity and 0.57 for frequency). These results did not reach statistical significance which was unsurprising given the sample size. The gender split between groups was similar.
Conclusion
Our evaluation suggests a difference in CAARMS scores between patients in our service with a diagnosis of autism and those without. A larger study would be needed to confirm a statistically significant difference and multicentre results would be needed as evidence of generalisability. However, if such a difference were confirmed it might question the validity of CAARMS in autistic patients or suggest that modifications, perhaps in the form of reasonable adjustments to the questions or scoring, were needed to increase the validity in this population. We would suggest that spending extra time checking the patient has understood the intended meaning of the questions in the CAARMS may increase validity, particularly in the ‘Non-Bizarre Ideas’ domain.