To send content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about sending content to .
To send content items to your Kindle, first ensure email@example.com
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about sending to your Kindle.
Note you can select to send to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We provide an introduction of the functioning, implementation, and challenges of convolutional neural networks (CNNs) to classify visual information in social sciences. This tool can help scholars to make more efficient the tedious task of classifying images and extracting information from them. We illustrate the implementation and impact of this methodology by coding handwritten information from vote tallies. Our paper not only demonstrates the contributions of CNNs to both scholars and policy practitioners, but also presents the practical challenges and limitations of the method, providing advice on how to deal with these issues.
The user's gaze can provide important information for human–machine interaction, but the analysis of manual gaze data is extremely time-consuming, inhibiting wide adoption in usability studies. Existing methods for automated areas of interest (AOI) analysis cannot be applied to tangible products with a screen-based user interface (UI), which have become ubiquitous in everyday life. The objective of this paper is to present and evaluate a method to automatically map the user's gaze to dynamic AOIs on tangible screen-based UIs based on computer vision and deep learning. This paper presents an algorithm for automated Dynamic AOI Mapping (aDAM), which allows the automated mapping of gaze data recorded with mobile eye tracking to the predefined AOIs on tangible screen-based UIs. The evaluation of the algorithm is performed using two medical devices, which represent two extreme examples of tangible screen-based UIs. The different elements of aDAM are examined for accuracy and robustness, as well as the time saved compared to manual mapping. The break-even point for an analyst's effort for aDAM compared to manual analysis is found to be 8.9 min gaze data time. The accuracy and robustness of both the automated gaze mapping and the screen matching indicate that aDAM can be applied to a wide range of products. aDAM allows, for the first time, automated AOI analysis of tangible screen-based UIs with AOIs that dynamically change over time. The algorithm requires some additional initial input for the setup and training, but analyzed gaze data duration and effort is only determined by computation time and does not require any additional manual work thereafter. The efficiency of the approach has the potential for a broader adoption of mobile eye tracking in usability testing for the development of new products and may contribute to a more data-driven usability engineering process in the future.
Deep learning has pushed the scope of digital pathology beyond simple digitization and telemedicine. The incorporation of these algorithms in routine workflow is on the horizon and maybe a disruptive technology, reducing processing time, and increasing detection of anomalies. While the newest computational methods enjoy much of the press, incorporating deep learning into standard laboratory workflow requires many more steps than simply training and testing a model. Image analysis using deep learning methods often requires substantial pre- and post-processing order to improve interpretation and prediction. Similar to any data processing pipeline, images must be prepared for modeling and the resultant predictions need further processing for interpretation. Examples include artifact detection, color normalization, image subsampling or tiling, removal of errant predictions, etc. Once processed, predictions are complicated by image file size – typically several gigabytes when unpacked. This forces images to be tiled, meaning that a series of subsamples from the whole-slide image (WSI) are used in modeling. Herein, we review many of these methods as they pertain to the analysis of biopsy slides and discuss the multitude of unique issues that are part of the analysis of very large images.
Visual and auditory signs of patient functioning have long been used for clinical diagnosis, treatment selection, and prognosis. Direct measurement and quantification of these signals can aim to improve the consistency, sensitivity, and scalability of clinical assessment. Currently, we investigate if machine learning-based computer vision (CV), semantic, and acoustic analysis can capture clinical features from free speech responses to a brief interview 1 month post-trauma that accurately classify major depressive disorder (MDD) and posttraumatic stress disorder (PTSD).
N = 81 patients admitted to an emergency department (ED) of a Level-1 Trauma Unit following a life-threatening traumatic event participated in an open-ended qualitative interview with a para-professional about their experience 1 month following admission. A deep neural network was utilized to extract facial features of emotion and their intensity, movement parameters, speech prosody, and natural language content. These features were utilized as inputs to classify PTSD and MDD cross-sectionally.
Both video- and audio-based markers contributed to good discriminatory classification accuracy. The algorithm discriminates PTSD status at 1 month after ED admission with an AUC of 0.90 (weighted average precision = 0.83, recall = 0.84, and f1-score = 0.83) as well as depression status at 1 month after ED admission with an AUC of 0.86 (weighted average precision = 0.83, recall = 0.82, and f1-score = 0.82).
Direct clinical observation during post-trauma free speech using deep learning identifies digital markers that can be utilized to classify MDD and PTSD status.
Spot spraying POST herbicides is an effective approach to reduce herbicide input and weed control cost. Machine vision detection of grass or grass-like weeds in turfgrass systems is a challenging task due to the similarity in plant morphology. In this work, we explored the feasibility of using image classification with deep convolutional neural networks (DCNN), including AlexNet, GoogLeNet, and VGGNet, for detection of crabgrass species (Digitaria spp.), doveweed [Murdannia nudiflora (L.) Brenan], dallisgrass (Paspalum dilatatum Poir.), and tropical signalgrass [Urochloa distachya (L.) T.Q. Nguyen] in bermudagrass [Cynodon dactylon (L.) Pers.]. VGGNet generally outperformed AlexNet and GoogLeNet in detecting selected grassy weeds. For detection of P. dilatatum, VGGNet achieved high F1 scores (≥0.97) and recall values (≥0.99). A single VGGNet model exhibited high F1 scores (≥0.93) and recall values (1.00) that reliably detected Digitaria spp., M. nudiflora, P. dilatatum, and U. distachya. Low weed density reduced the recall values of AlexNet at detecting all weed species and GoogLeNet at detecting Digitaria spp. In comparison, VGGNet achieved excellent performances (overall accuracy = 1.00) at detecting all weed species in both high and low weed-density scenarios. These results demonstrate the feasibility of using DCNN for detection of grass or grass-like weeds in turfgrass systems.
Vision and language are two fundamental capabilities of human intelligence. Humans routinely perform tasks through the interactions between vision and language, supporting the uniquely human capacity to talk about what they see or hallucinate a picture on a natural-language description. The valid question of how language interacts with vision motivates us researchers to expand the horizons of computer vision area. In particular, “vision to language” is probably one of the most popular topics in the past 5 years, with a significant growth in both volume of publications and extensive applications, e.g. captioning, visual question answering, visual dialog, language navigation, etc. Such tasks boost visual perception with more comprehensive understanding and diverse linguistic representations. Going beyond the progresses made in “vision to language,” language can also contribute to vision understanding and offer new possibilities of visual content creation, i.e. “language to vision.” The process performs as a prism through which to create visual content conditioning on the language inputs. This paper reviews the recent advances along these two dimensions: “vision to language” and “language to vision.” More concretely, the former mainly focuses on the development of image/video captioning, as well as typical encoder–decoder structures and benchmarks, while the latter summarizes the technologies of visual content creation. The real-world deployment or services of vision and language are elaborated as well.
In this paper, we study the semantic segmentation of 3D LiDAR point cloud data in urban environments for autonomous driving, and a method utilizing the surface information of the ground plane was proposed. In practice, the resolution of a LiDAR sensor installed in a self-driving vehicle is relatively low and thus the acquired point cloud is indeed quite sparse. While recent work on dense point cloud segmentation has achieved promising results, the performance is relatively low when directly applied to sparse point clouds. This paper is focusing on semantic segmentation of the sparse point clouds obtained from 32-channel LiDAR sensor with deep neural networks. The main contribution is the integration of the ground information which is used to group ground points far away from each other. Qualitative and quantitative experiments on two large-scale point cloud datasets show that the proposed method outperforms the current state-of-the-art.
We discuss the geometry of rational maps from a projective space of an arbitrary dimension to the product of projective spaces of lower dimensions induced by linear projections. In particular, we give an algebro-geometric variant of the projective reconstruction theorem by Hartley and Schaffalitzky.
Universal access on equal terms to audiovisual content is a key point for the full inclusion of people with disabilities in activities of daily life. As a real challenge for the current Information Society, it has been detected but not achieved in an efficient way, due to the fact that current access solutions are mainly based in the traditional television standard and other not automated high-cost solutions. The arrival of new technologies within the hybrid television environment together with the application of different artificial intelligence techniques over the content will assure the deployment of innovative solutions for enhancing the user experience for all. In this paper, a set of different tools for image enhancement based on the combination between deep learning and computer vision algorithms will be presented. These tools will provide automatic descriptive information of the media content based on face detection for magnification and character identification. The fusion of this information will be finally used to provide a customizable description of the visual information with the aim of improving the accessibility level of the content, allowing an efficient and reduced cost solution for all.
Visual homing is a local navigation technique used to direct a robot to a previously seen location by comparing the image of the original location with the current visual image. Prior work has shown that exploiting depth cues such as image scale or stereo-depth in homing leads to improved homing performance. While it is not unusual to use a panoramic field of view (FOV) camera in visual homing, it is unusual to have a panoramic FOV stereo-camera. So, while the availability of stereo-depth information may improve performance, the concomitant-restricted FOV may be a detriment to performance, unless specialized stereo hardware is used. In this paper, we present an investigation of the effect on homing performance of varying the FOV widths in a stereo-vision-based visual homing algorithm using a common stereo-camera. We have collected six stereo-vision homing databases – three indoor and three outdoor. Based on over 350,000 homing trials, we show that while a larger FOV yields performance improvements for larger homing offset angles, the relative improvement falls off with increasing FOVs, and in fact decreases for the widest FOV tested. We conduct additional experiments to identify the cause of this fall-off in performance, which we term the ‘blinder’ effect, and which we predict should affect other correspondence-based visual homing algorithms.
The paper reports on a new multi-view algorithm that combines information from multiple images of a single target object, captured at different distances, to determine the identity of an object. Due to the use of global feature descriptors, the method does not involve image segmentation. The performance of the algorithm has been evaluated on a binary classification problem for a data set consisting of a series of underwater images.
Safe and accurate navigation for autonomous trajectory tracking of quadrotors using monocular vision is addressed in this paper. A second order Sliding Mode (2-SM) control algorithm is used to track desired trajectories, providing robustness against model uncertainties and external perturbations. The time-scale separation of the translational and rotational dynamics allows to design position controllers by giving a desired reference in roll and pitch angles, which is suitable for practical validation in quad-rotors equipped with an internal attitude controller. A Lyapunov based analysis proved the closed-loop stability of the system despite the presence of unknown external perturbations. Monocular vision fused with inertial measurements are used to estimate the vehicle's pose with respect to unstructured scenes. In addition, the distance to potential collisions is detected and computed using the sparse depth map coming also from the vision algorithm. The proposed strategy is successfully tested in real-time experiments, using a low-cost commercial quadrotor.
In this paper we look at recent advances in artificial intelligence. Decades in the making, a confluence of several factors in the past few years has culminated in a string of breakthroughs in many longstanding research challenges. A number of problems that were considered too challenging just a few years ago can now be solved convincingly by deep neural networks. Although deep learning appears to be reducing the algorithmic problem solving to a matter of data collection and labeling, we believe that many insights learned from ‘pre-Deep Learning’ works still apply and will be more valuable than ever in guiding the design of novel neural network architectures.
Many plant diseases have distinct visual symptoms which can be used to identify and classify them correctly. This paper presents a potato disease classification algorithm which leverages these distinct appearances and the recent advances in computer vision made possible by deep learning. The algorithm uses a deep convolutional neural network training it to classify the tubers into five classes, four diseases classes and a healthy potato class. The database of images used in this study, containing potatoes of different shapes, sizes and diseases, was acquired, classified, and labelled manually by experts. The models were trained over different train-test splits to better understand the amount of image data needed to apply deep learning for such classification tasks.
This paper focuses on maturity evaluation derived by a color camera for a sweet pepper robotic harvester. Different color and morphological features for sweet pepper maturity were evaluated. Side view and bottom view of sweet paper were analyzed and compared for their ability to classify into 4 maturity classes. The goal of this study was to differentiate between the two center classes which are difficult to separate. Statistical analysis of 13 different features in reliance to the maturity classification and the views indicated the best features for classification. The results show that the features that can be used for classification between the two central classes from both bottom and side views are: Hue range, Equal2Real – the ratio between the equivalent equal sized circle perimeter to the shape perimeter and Area2Peri – the ratio between the area to the perimeter.
We propose a novel stereo visual IMU-assisted (Inertial Measurement Unit) technique that extends to large inter-frame motion the use of KLT tracker (Kanade–Lucas–Tomasi). The constrained and coherent inter-frame motion acquired from the IMU is applied to detected features through homogenous transform using 3D geometry and stereoscopy properties. This predicts efficiently the projection of the optical flow in subsequent images. Accurate adaptive tracking windows limit tracking areas resulting in a minimum of lost features and also prevent tracking of dynamic objects. This new feature tracking approach is adopted as part of a fast and robust visual odometry algorithm based on double dogleg trust region method. Comparisons with gyro-aided KLT and variants approaches show that our technique is able to maintain minimum loss of features and low computational cost even on image sequences presenting important scale change. Visual odometry solution based on this IMU-assisted KLT gives more accurate result than INS/GPS solution for trajectory generation in certain context.
This study introduces a passive autofocus method based on image analysis calculating the Bayes spectral entropy (BSE). The method is applied to optical microscopy and together with the specific construction of the opto-mechanical unit, it allows the analysis of large samples with complicated surfaces without subsampling. This paper will provide a short overview of the relevant theory of calculating the normalized discrete cosine transform when analyzing obtained images, in order to find the BSE measure. Furthermore, it will be shown that the BSE measure is a strong indicator, helping to determine the focal position of the optical microscope. To demonstrate the strength and robustness of the microscope system, tests have been performed using a 1951 USAF test pattern resolution chart determining the in focus position of the microscope. Finally, this method and the optical microscope system is applied to analyze an optical grating (100 lines/mm) demonstrating the detection of the focal position. The paper concludes with an outlook of potential applications of the presented system within quality control and surface analysis.
Visual Homing is a navigation method based on comparing a stored image of a goal location to the current image to determine how to navigate to the goal location. It is theorized that insects such as ants and bees employ visual homing techniques to return to their nest or hive, and inspired by this, several researchers have developed elegant robot visual homing algorithms. Depth information, from visual scale, or other modality such as laser ranging, can improve the quality of homing. While insects are not well equipped for stereovision, stereovision is an effective robot sensor. We describe the challenges involved in using stereovision derived depth in visual homing and our proposed solutions. Our algorithm, Homing with Stereovision (HSV), utilizes a stereo camera mounted on a pan-tilt unit to build composite wide-field stereo images and estimate distance and orientation from the robot to the goal location. HSV is evaluated in a set of 200 indoor trials using two Pioneer 3-AT robots showing it effectively leverages stereo depth information when compared to a depth from scale approach.
In the context of topological mapping, the automatic segmentation of an environment into meaningful and distinct locations is still regarded as an open problem. This paper presents an algorithm to extract places online from image sequences based on the algebraic connectivity of graphs or Fiedler value, which provides an insight into how well connected several consecutive observations are. The main contribution of the proposed method is that it is a theoretically supported alternative to tuning thresholds on similarities, which is a difficult task and environment dependent. It can accommodate any type of feature detector and matching procedure, as it only requires non-negative similarities as input, and is therefore able to deal with descriptors of variable length, to which statistical techniques are difficult to apply. The method has been validated in an office environment using exclusively visual information. Two different types of features, a bag-of-words model built from scale invariant feature transform (SIFT) keypoints, and a more complex fingerprint based on vertical lines, color histograms, and a few Star keypoints, are employed to demonstrate that the method can be applied to both fixed and variable length descriptors with similar results.