A survey of structure from motion*.

Onur Özyeşil; Vladislav Voroninski; Ronen Basri; Amit Singer

doi:10.1017/S096249291700006X

A survey of structure from motion*.

Published online by Cambridge University Press: 05 May 2017

Onur Özyeşil ,

Vladislav Voroninski ,

Ronen Basri and

Amit Singer

Show author details

Onur Özyeşil: Affiliation:
INTECH Investment Management LLC, One Palmer Square, Suite 441, Princeton, NJ 08542, USA E-mail: oozyesil@intechjanus.com
Vladislav Voroninski: Affiliation:
Helm.ai, Menlo Park, CA 94025, USA E-mail: vlad@helm.ai
Ronen Basri: Affiliation:
Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, 76100, Israel E-mail: ronen.basri@weizmann.ac.il
Amit Singer: Affiliation:
Department of Mathematics and PACM, Princeton University, Princeton, NJ 08544-1000, USA E-mail: amits@math.princeton.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The structure from motion (SfM) problem in computer vision is to recover the three-dimensional (3D) structure of a stationary scene from a set of projective measurements, represented as a collection of two-dimensional (2D) images, via estimation of motion of the cameras corresponding to these images. In essence, SfM involves the three main stages of (i) extracting features in images (e.g. points of interest, lines, etc.) and matching these features between images, (ii) camera motion estimation (e.g. using relative pairwise camera positions estimated from the extracted features), and (iii) recovery of the 3D structure using the estimated motion and features (e.g. by minimizing the so-called reprojection error). This survey mainly focuses on relatively recent developments in the literature pertaining to stages (ii) and (iii). More specifically, after touching upon the early factorization-based techniques for motion and structure estimation, we provide a detailed account of some of the recent camera location estimation methods in the literature, followed by discussion of notable techniques for 3D structure recovery. We also cover the basics of the simultaneous localization and mapping (SLAM) problem, which can be viewed as a specific case of the SfM problem. Further, our survey includes a review of the fundamentals of feature extraction and matching (i.e. stage (i) above), various recent methods for handling ambiguities in 3D scenes, SfM techniques involving relatively uncommon camera models and image features, and popular sources of data and SfM software.

Type: Research Article
Information: Acta Numerica , Volume 26 , 01 May 2017 , pp. 305 - 364

DOI: https://doi.org/10.1017/S096249291700006X [Opens in a new window]
Copyright: © Cambridge University Press, 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES 27

Agarwal, S., Snavely, N., Seitz, S. and Szeliski, R. (2010), Bundle adjustment in the large. In ECCV 2010: 11th European Conference on Computer Vision, part II, Vol. 6312 of Lecture Notes in Computer Science, Springer, pp. 29–42.CrossRef Google Scholar

Agarwal, S., Snavely, N., Simon, I., Seitz, S. and Szeliski, R. (2009), Building Rome in a day. In ICCV 2009: 12th IEEE International Conference on Computer Vision.Google Scholar

Aliaga, D. (2001), Accurate catadioptric calibration for real-time pose estimation in room-size environments. In ICCV 2001: 8th IEEE International Conference on Computer Vision, pp. 127–134.Google Scholar

Arie-Nachimson, M., Kovalsky, S., Kemelmacher-Shlizerman, I., Singer, A. and Basri, R. (2012), Global motion estimation from point matches. In 3DimPVT 2012: 2nd IEEE International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, pp. 81–88.Google Scholar

Arrigoni, F., Fusiello, A. and Rossi, B. (2016), Camera motion from group synchronization. In 3DV 2016: 4th IEEE International Conference on 3D Vision, pp. 546–555.Google Scholar

Aulinas, J., Petillot, Y., Salvi, J. and Lladó, X. (2008), The SLAM problem: A survey. In 2008 Conference on Artificial Intelligence Research and Development; 11th International Conference of the Catalan Association for Artificial Intelligence, IOS Press, pp. 363–371.Google Scholar

Bartoli, A. and Sturm, P. (2005), ‘Structure-from-motion using lines: Representation, triangulation, and bundle adjustment’, Comput. Vision Image Underst. 100, 416–441.CrossRef Google Scholar

Bay, H., Tuytelaars, T. and Van Gool, L. (2006), SURF: Speeded up robust features. In ECCV 2006: 9th European Conference on Computer Vision, Vol. 3951 of Lecture Notes in Computer Science, Springer, pp. 404–417.CrossRef Google Scholar

Bolles, R. and Fischler, M. (1981), A RANSAC-based approach to model fitting and its application to finding cylinders in range data. In IJCAI ’81: 7th International Joint Conference on Artificial intelligence, part 2, pp. 637–643.Google Scholar

Brand, M., Antone, M. and Teller, S. (2004), Spectral solution of large-scale extrinsic camera calibration as a graph embedding problem. In ECCV 2004: 8th European Conference on Computer Vision, Vol. 3022 of Lecture Notes in Computer Science, Springer, pp. 262–273.Google Scholar

Chang, P. and Hebert, M. (2000), Omni-directional structure from motion. In 2000 IEEE Workshop on Omnidirectional Vision, pp. 127–133.Google Scholar

Chatterjee, A. and Govindu, V. (2013), Efficient and robust large-scale rotation averaging. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 521–528.Google Scholar

Chiuso, A., Brockett, R. and Soatto, S. (2000), ‘Optimal structure from motion: Local ambiguities and global estimates’, Int. J. Comput. Vision 39, 195–228.CrossRef Google Scholar

Cohen, A., Zach, C., Sinha, S. and Pollefeys, M. (2012), Discovering and exploiting 3D symmetries in structure from motion. In CVPR 2012: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1514–1521.CrossRef Google Scholar

Crandall, D., Owens, A., Snavely, N. and Huttenlocher, D. (2011), Discrete-continuous optimization for large-scale structure from motion. In CVPR 2011: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3001–3008.Google Scholar

Cucuringu, M., Singer, A. and Cowburn, D. (2012), ‘Eigenvector synchronization, graph rigidity and the molecule problem’, Inf. Inference 1, 27–67.Google Scholar

Engel, J., Schöps, T. and Cremers, D. (2014), LSD-SLAM: Large-scale direct monocular SLAM. In ECCV 2014: 13th European Conference on Computer Vision, Vol. 8690 of Lecture Notes in Computer Science, Springer, pp. 834–849.CrossRef Google Scholar

Fuentes-Pacheco, J., Ruiz-Ascencio, J. and Rendón-Mancha, J. (2015), ‘Visual simultaneous localization and mapping: A survey’, Artificial Intelligence Review 43, 55–81.CrossRef Google Scholar

Furukawa, Y. and Ponce, J. (2010a), ‘Accurate, dense, and robust multiview stereopsis’, IEEE Trans. Pattern Anal. Mach. Intel. 32, 1362–1376.Google Scholar

Furukawa, Y. and Ponce, J. (2010b), PMVS: Patch-based multi-view stereo software. http://www.di.ens.fr/pmvs/ Google Scholar

Furukawa, Y., Curless, B., Seitz, S. and Szeliski, R. (2010a), CMVS: Clustering views for multi-view stereo. http://www.di.ens.fr/cmvs/ Google Scholar

Furukawa, Y., Curless, B., Seitz, S. and Szeliski, R. (2010b), Towards Internet-scale multi-view stereo. In CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1434–1441.Google Scholar

Gauglitz, S., Höllerer, T. and Turk, M. (2011), ‘Evaluation of interest point detectors and feature descriptors for visual tracking’, Int. J. Comput. Vision 94, 335–360.CrossRef Google Scholar

Gluckman, J. and Nayar, S. (1998), Ego-motion and omnidirectional cameras. In Sixth International Conference on Computer Vision, IEEE cat. no. 98CH36271, pp. 999–1005.Google Scholar

Goldstein, T., Hand, P., Lee, C., Voroninski, V. and Soatto, S. (2016), ShapeFit and ShapeKick for robust, scalable structure from motion. In ECCV 2016: 14th European Conference on Computer Vision, Vol. 9911 of Lecture Notes in Computer Science, Springer, pp. 289–304.Google Scholar

Govindu, V. (2001), Combining two-view constraints for motion estimation. In CVPR 2001: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, part 2, pp. II-218–II-225.Google Scholar

Govindu, V. (2004), Lie-algebraic averaging for globally consistent motion estimation. In CVPR 2004: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, part 1, pp. I-684–I-691.Google Scholar

Hand, P., Lee, C. and Voroninski, V. (2015), Exact simultaneous recovery of locations and structure from known orientations and corrupted point correspondences. arXiv:arXiv:1509.05064 Google Scholar

Hand, P., Lee, C. and Voroninski, V. (2017), ‘ShapeFit: Exact location recovery from corrupted pairwise directions’, Comm. Pure Appl. Math., to appear.Google Scholar

Hartley, R. (1997), ‘In defense of the eight-point algorithm’, IEEE Trans. Pattern Anal. Mach. Intel. 19, 580–593.Google Scholar

Hartley, R. and Zisserman, A. (2000), Multiple View Geometry in Computer Vision, Cambridge University Press.Google Scholar

Havlena, M., Torii, A. and Pajdla, T. (2010), Efficient structure from motion by graph optimization. In ECCV 2010: 11th European Conference on Computer Vision, Vol. 6312 of Lecture Notes in Computer Science, Springer, pp. 100–113.CrossRef Google Scholar

Havlena, M., Torii, A., Knopp, J. and Pajdla, T. (2009), Randomized structure from motion based on atomic 3D models from camera triplets. In CVPR 2009: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2874–2881.Google Scholar

Hernandez, J., Tsotsos, K. and Soatto, S. (2015), Observability, identifiability and sensitivity of vision-aided inertial navigation. In ICRA 2015: IEEE International Conference on Robotics and Automation, pp. 2319–2325.Google Scholar

Jiang, N., Cui, Z. and Tan, P. (2013), A global linear method for camera pose registration. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 481–488.Google Scholar

Jiang, N., Tan, P. and Cheong, L. (2012), Seeing double without confusion: Structure-from-motion in highly ambiguous scenes. In CVPR 2012: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1458–1465.Google Scholar

Kanade, T. and Morris, D. (1998), Factorization methods for structure from motion. In Philos. Trans. Royal Soc. London,Vol. 356, pp. 1153–1173.Google Scholar

Kannala, J. and Brandt, S. (2006), ‘A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses’, IEEE Trans. Pattern Anal. Mach. Intel. 28, 1335–1340.Google Scholar

Ke, Y. and Sukthankar, R. (2004), PCA-SIFT: A more distinctive representation for local image descriptors. In CVPR 2004: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, part 2, pp. 506–513.Google Scholar

Longuet-Higgins, H. (1981), ‘A computer algorithm for reconstructing a scene from two projections’, Nature 293, 133–135.CrossRef Google Scholar

Lourakis, M. and Argyros, A. (2009), ‘SBA: A software package for generic sparse bundle adjustment’, ACM Trans. Math. Softw. 36, 2:1–2:30.Google Scholar

Lowe, D. (1999), Object recognition from local scale-invariant features. In ICCV 1999: 7th IEEE International Conference on Computer Vision, part 2, pp. 1150–1157.Google Scholar

Lowe, D. (2004), ‘Distinctive image features from scale-invariant keypoints’, Int. J. Comput. Vision 60, 91–110.CrossRef Google Scholar

Ma, Y., Košecká, J. and Sastry, S. (2001), ‘Optimization criteria and geometric algorithms for motion and structure estimation’, Int. J. Comput. Vision 44, 219–249.Google Scholar

Martinec, D. and Pajdla, T. (2003), CVPR 2003: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. part 1, pp. 497–502.Google Scholar

Martinec, D. and Pajdla, T. (2007), Robust rotation and translation estimation in multiview reconstruction. In CVPR 2007: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.Google Scholar

Micusik, B. and Pajdla, T. (2006), ‘Structure from motion with wide circular field of view cameras’, IEEE Trans. Pattern Anal. Mach. Intel. 28, 1135–1149.Google Scholar

Mikolajczyk, K. and Schmid, C. (2005), ‘A performance evaluation of local descriptors’, IEEE Trans. Pattern Anal. Mach. Intel. 27, 1615–1630.Google Scholar

Moulon, P., Monasse, P. and Marlet, R. (2013), Global fusion of relative motions for robust, accurate and scalable structure from motion. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 3248–3255.Google Scholar

Moulon, P., Monasse, P. and Marlet, R. et al. OpenMVG: An open multiple view geometry library. https://github.com/openMVG/openMVG Google Scholar

Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F. and Sayd, P. (2009), ‘Generic and real-time structure from motion using local bundle adjustment’, Image and Vision Computing 27, 1178–1193.Google Scholar

Musialski, P., Wonka, P., Aliaga, D., Wimmer, M., Van Gool, L. and Purgathofer, W. (2013), A survey of urban reconstruction. In Computer Graphics Forum,Vol. 32, pp. 146–177.Google Scholar

Oliensis, J. (2000), ‘A critique of structure-from-motion algorithms’, Comput. Vision Image Underst. 80, 172–214.CrossRef Google Scholar

Özyeşil, O. and Singer, A. (2015), Robust camera location estimation by convex programming. In CVPR 2015: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2674–2683.Google Scholar

Özyeşil, O., Singer, A. and Basri, R. (2015), ‘Stable camera motion estimation using convex programming’, SIAM J. Imaging Sci. 8, 1220–1262.Google Scholar

Pachauri, D., Kondor, R., Sargur, G. and Singh, V. (2014), Permutation diffusion maps (PDM) with application to the image association problem in computer vision. In Advances in Neural Information Processing Systems 27 (Ghahramani, Z. et al. , ed.), Curran Associates, pp. 541–549.Google Scholar

Pollefeys, M., Nistér, D., Frahm, J.-M., Akbarzadeh, A., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S.-J., Merrell, P., Salmi, C., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewénius, H., Yang, R., Welch, G. and Towles, H. (2008), ‘Detailed real-time urban 3D reconstruction from video’, Int. J. Comput. Vision 78, 143–167.Google Scholar

Quan, L. and Kanade, T. (1997), ‘Affine structure from line correspondences with uncalibrated affine cameras’, IEEE Trans. Pattern Anal. Mach. Intel. 19, 834–845.Google Scholar

Ramalingam, S., Lodha, S. and Sturm, P. (2006), ‘‘A generic structure-from-motion framework’’, Comput. Vision Image Underst. 103, 218–228.Google Scholar

Roberts, R., Sinha, S., Szeliski, R., Steedly, D. and Szeliski, R. (2011), Structure from motion for scenes with large duplicate structures. In CVPR 2011: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3144.Google Scholar

Schaffalitzky, F. and Zisserman, A. (2002), Multi-view matching for unordered image sets, or ‘How do I organize my holiday snaps? In ECCV 2002: 7th European Conference on Computer Vision, part 1, Vol. 2350 of Lecture Notes in Computer Science, Springer, pp. 414–431.CrossRef Google Scholar

Schindler, G., Krishnamurthy, P. and Dellaert, F. (2006), Line-based structure from motion for urban environments. In Third International Symposium on 3D Data Processing, Visualization, and Transmission, IEEE, pp. 846–853.Google Scholar

Schönberger, J. and Frahm, J.-M. (2016), Structure-from-motion revisited. In CVPR 2016: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113.Google Scholar

Schönberger, J., Zheng, E., Frahm, J.-M. and Pollefeys, M. (2016), Pixelwise view selection for unstructured multi-view stereo. In ECCV 2016: 14th European Conference on Computer Vision, part III, Springer, pp. 501–518.Google Scholar

Shakernia, O., Vidal, R. and Sastry, S. (2003), Omnidirectional egomotion estimation from back-projection flow. In CVPRW 2003: Computer Vision and Pattern Recognition Workshop, part 7, pp. 82–82.Google Scholar

Singer, A. (2011), ‘‘Angular synchronization by eigenvectors and semidefinite programming’’, Appl. Comput. Harmon. Anal. 30, 20–36.Google Scholar

Sinha, S., Steedly, D. and Szeliski, R. (2010), A multi-stage linear approach to structure from motion. In ECCV 2010 Workshops: Trends and Topics in Computer Vision, part II, Vol. 6554 of Lecture Notes in Computer Science, Springer, pp. 267–281.Google Scholar

Snavely, N., Seitz, S. and Szeliski, R. (2006), Photo tourism: Exploring photo collections in 3D. In ACM Trans. Graph.,Vol. 25, pp. 835–846.Google Scholar

Snavely, N., Seitz, S. and Szeliski, R. (2008a), Modeling the world from internet photo collections. In Int. J. Comput. Vision,Vol. 80, pp. 189–210.Google Scholar

Snavely, N., Seitz, S. and Szeliski, R. (2008b), Skeletal graphs for efficient structure from motion. In CVPR 2008: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.Google Scholar

Soatto, S. (1997), ‘3-D structure from visual motion: Modeling, representation and observability’, Automatica 33, 1287–1312.Google Scholar

Strecha, C., Hansen, W., Van Gool, L., Fua, P. and Thoennessen, U. (2008), On benchmarking camera calibration and multi-view stereo for high resolution imagery. In CVPR 2008: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.Google Scholar

Sturm, P. and Triggs, B. (1996), A factorization based algorithm for multi-image projective structure and motion. In ECCV ’96: 4th European Conference on Computer Vision, part II, Vol. 1065 of Lecture Notes in Computer Science, Springer, pp. 709–720.Google Scholar

Sweeney, C. (2016), Theia multiview geometry library: Tutorial and reference. http://theia-sfm.org CrossRef Google Scholar

Taylor, C. and Kriegman, D. (1995), ‘‘Structure and motion from line segments in multiple images’’, IEEE Trans. Pattern Anal. Mach. Intel. 17, 1021–1032.Google Scholar

Tomasi, C. and Kanade, T. (1992), ‘‘Shape and motion from image streams under orthography: A factorization method’’, Int. J. Comput. Vision 9, 137–154.Google Scholar

Triggs, B. (1996), Factorization methods for projective structure and motion. In CVPR ’96: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 845–851.Google Scholar

Triggs, B., McLauchlan, P., Hartley, R. and Fitzgibbon, A. (2000), Bundle adjustment: A modern synthesis. In Vision Algorithms: Theory and Practice, Vol. 1883 of Lecture Notes in Computer Science, Springer, pp. 298–375.Google Scholar

Tron, R. and Vidal, R. (2014), ‘‘Distributed

$3$ -D localization of camera sensor networks from

$2$ -D image measurements’’, IEEE Trans. Automatic Control 59, 3325–3340.Google Scholar

Tron, R., Zhou, X. and Daniilidis, K. (2016), A survey on rotation optimization in structure from motion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 77–85.Google Scholar

Tsotsos, K., Chiuso, A. and Soatto, S. (2015), Robust inference for visual-inertial sensor fusion. In ICRA 2015: IEEE International Conference on Robotics and Automation, pp. 5203–5210.Google Scholar

Tuytelaars, T. and Mikolajczyk, K. (2008), ‘‘Local invariant feature detectors: A survey’’, Found. Trends Comput. Graphics Vision 3, 177–280.Google Scholar

Vedaldi, A., Guidi, G. and Soatto, S. (2007), Moving forward in structure from motion. In CVPR 2007: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7.Google Scholar

Wang, L. and Singer, A. (2013), ‘Exact and stable recovery of rotations for robust synchronization’, Inf. Inference 2, 145–193.Google Scholar

Wilson, K. and Snavely, N. (2013), Network principles for SfM: Disambiguating repeated structures with local context. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 513–520.Google Scholar

Wilson, K. and Snavely, N. (2014), Robust global translations with 1DSfM. In ECCV 2014: 13th European Conference on Computer Vision, part III, Vol. 8691 of Lecture Notes in Computer Science, Springer, pp. 61–75.Google Scholar

Wu, C. (2007), SiftGPU: A GPU implementation of scale invariant feature transform (SIFT). http://cs.unc.edu/∼ccwu/siftgpu/ Google Scholar

Wu, C. (2011), VisualSFM: A visual structure from motion system. http://ccwu.me/vsfm/ Google Scholar

Wu, C., Agarwal, S., Curless, B. and Seitz, S. (2011), Multicore bundle adjustment. In CVPR 2011: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3057–3064.Google Scholar

Younes, G., Asmar, D. and Shammas, E. (2016), A survey on non-filter-based monocular visual SLAM systems. arXiv:1607.00470 Google Scholar

Zach, C., Klopschitz, M. and Pollefeys, M. (2010), Disambiguating visual relations using loop constraints. In CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1426–1433.Google Scholar

Zhang, Z. (1998), ‘‘Determining the epipolar geometry and its uncertainty: A review’’, Int. J. Comput. Vision 27, 161–195.Google Scholar

Article contents

A survey of structure from motion*.

Abstract

Access options

References

REFERENCES 27

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests