A PROBABILISTIC ℓ1 METHOD FOR CLUSTERING HIGH-DIMENSIONAL DATA

Tsvetan Asamov; Adi Ben-Israel

doi:10.1017/S0269964820000479

A PROBABILISTIC ℓ1 METHOD FOR CLUSTERING HIGH-DIMENSIONAL DATA

Published online by Cambridge University Press: 05 April 2021

Tsvetan Asamov and

Adi Ben-Israel

Show author details

Tsvetan Asamov: Affiliation:
Department of Management Science & Information Systems, Rutgers Business School, 100 Rockafeller Road, Piscataway, NJ08854, USA E-mails: tsvetan.asamov@gmail.com; adi.benisrael@gmail.com
Adi Ben-Israel: Affiliation:
Department of Management Science & Information Systems, Rutgers Business School, 100 Rockafeller Road, Piscataway, NJ08854, USA E-mails: tsvetan.asamov@gmail.com; adi.benisrael@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In general, the clustering problem is NP-hard, and global optimality cannot be established for non-trivial instances. For high-dimensional data, distance-based methods for clustering or classification face an additional difficulty, the unreliability of distances in very high-dimensional spaces. We propose a probabilistic, distance-based, iterative method for clustering data in very high-dimensional space, using the ℓ1-metric that is less sensitive to high dimensionality than the Euclidean distance. For K clusters in ℝn, the problem decomposes to K problems coupled by probabilities, and an iteration reduces to finding Kn weighted medians of points on a line. The complexity of the algorithm is linear in the dimension of the data space, and its performance was observed to improve significantly as the dimension increases.

Keywords

clustering continuous location high-dimensional data ℓ1-norm

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 36 , Issue 2 , April 2022 , pp. 433 - 448

DOI: https://doi.org/10.1017/S0269964820000479 [Opens in a new window]
Copyright: Copyright © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aggarwal, C.C., Hinneburg, A., & Keim, D.A. (2001). On the surprising behavior of distance metrics in high dimensional space. In Van den Bussche, J., & Vianu, V. (eds.), Database Theory — ICDT 2001. ICDT 2001. Lecture Notes in Computer Science, vol 1973. Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-44503-X_27.Google Scholar

Andoni, A. & Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.Communications of the ACM 51 (1): 117–122.CrossRef Google Scholar

Arav, M. (2008). Contour approximation of data and the harmonic mean. Journal of Mathematical Inequalities 2: 161–167.Google Scholar

Babak, O. (2014). Inverse distance interpolation for facies modeling. Stochastic Environmental Research and Risk Assessment 28: 1373–1382.CrossRef Google Scholar

Basu, A. & Saxena, N.K. (2002). Bathymetry data correction using global optimization method. Marine Geodesy 25: 37–60.CrossRef Google Scholar

Beck, A. & Sabach, S. (2015). Weiszfeld's method: old and new results. Journal of Optimization Theory and Applications 164: 1–40.CrossRef Google Scholar

Ben-Israel, A. & Iyigun, C. (2008). Probabilistic distance clustering. Journal of Classification 25: 5–26.CrossRef Google Scholar

Ben-Israel, A. & Iyigun, C. (2010). Clustering, classification and contour approximation of data. In Y. Censor, M. Jiang, & G. Wang (eds.), Biomedical mathematics: promising directions in imaging, therapy planning and inverse problems. Madison, Wisconsin: Medical Physics Publishing, pp. 75–100.Google Scholar

Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor” meaningful? In C. Beeri & P. Buneman (eds.) Database Theory — ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Berlin, Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-49257-7_15.CrossRef Google Scholar

Bezdek, J.C. (1973). Fuzzy mathematics in pattern classification. Doctoral Dissertation, Cornell University, Ithaca.Google Scholar

Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.CrossRef Google Scholar

Bezdek, J.C. & Pal, S.K. (eds.) (1992). Fuzzy models for pattern recognition: methods that search for structure in data. New York: IEEE Press.Google Scholar

Chazelle, B. (2008). Finding a good neighbor, near and fast. Communications of the ACM 51: 115.CrossRef Google Scholar

Dixon, K.R. & Chapman, J.A. (1980). Harmonic mean measure of animal activity areas. Ecology 61: 1040–1044.CrossRef Google Scholar

Drezner, Z. (1984). The planar two-center and two-median problems. Transportation Science 18: 351–361.CrossRef Google Scholar

Franke, R. & Nielson, G. (1980). Smooth interpolation of large sets of scattered data. International Journal for Numerical Methods in Engineering 15: 1691–1704.CrossRef Google Scholar

Hammersley, J.M. (1950). The distribution of distance in a hypersphere. The Annals of Mathematical Statistics 21: 447–452.CrossRef Google Scholar

Iyigun, C. & Ben-Israel, A. (2008). Probabilistic distance clustering adjusted for cluster size. Probability in the Engineering and Informational Sciences 22: 1–19.CrossRef Google Scholar

Iyigun, C. & Ben-Israel, A. (2010). A generalized Weiszfeld method for the multi-facility location problem. Operations Research Letters 38: 207–214.CrossRef Google Scholar

Kailing, K., Kriegel, H. P, & Kröger, P. (2004). Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Orlando, FL, USA.CrossRef Google Scholar

Klawonn, F. (2013). What can fuzzy cluster analysis contribute to clustering of high-dimensional data? In F. Masulli, G. Pasi, & R. Yager (eds.), Fuzzy logic and applications.WILF 2013. Lecture Notes in Computer Science, vol 8256. Cham: Springer. https://doi.org/10.1007/978-3-319-03200-9_27.Google Scholar

Longley, P.A., Goodchild, M.F., Maguire, D.J., & Rhind, D.W. (2016). Geographic information science and systems. New York: Wiley.Google Scholar

Luce, R.D. (1959). Individual choice behavior: a theoretical analysis. New York: Wiley.Google Scholar

MATLAB version 7.14.0.739. Natick, Massachusetts: The MathWorks Inc., 2012.Google Scholar

Megiddo, N. & Supowit, K.J. (1984). On the complexity of some common geometric location problems. SIAM Journal on Computing 13: 182–196.CrossRef Google Scholar

de Mesnard, L. (2013). Pollution models and inverse distance weighting: some critical remarks. Computers and Geosciences 52: 459–469.CrossRef Google Scholar

Ruprecht, D. & Muller, H. (1995). Image warping with scattered data interpolation. IEEE Computer Graphics and Applications 15: 37–43.CrossRef Google Scholar

Shepard, D.S. (1968). A two-dimensional interpolation function for irregularly spaced data. In Proceedings of 23rd National Conference Association for Computing Machinery. Princeton, NJ: Brandon/Systems Press, pp. 517–524.CrossRef Google Scholar

Shiode, S. & Shiode, N. (2009). Inverse distance-weighted interpolation on a street network. In Y. Asami, Y. Sadahiro, & T. Ishikawa (eds.), New frontiers in urban analysis: In Honor of Atsuyuki Okabe, Chapter 10. Boca Raton: CRC Press, pp. 179–196.Google Scholar

Stanforth, R.W., Kolossov, E., & Mirkin, B. (2007). A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR & Combinatorial Science 26, 837–844.CrossRef Google Scholar

Teboulle, M. (2007). A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research 8: 65–102.Google Scholar

Weiszfeld, E. (1937). Sur le point par lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal 43: 355–386.Google Scholar

Ye, D.H., Pohl, K.M., Litt, H., & Davatzikos, C. (2010). Groupwise morphometric analysis based on high dimensional clustering. In Proc. MMBIA 2010: IEEE Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis at CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA.Google Scholar

Yellott, J.I., Jr. (2001). Luce's choice axiom. In N.J. Smelser & P.B. Baltes (eds.), International encyclopedia of the social & behavioral sciences, Amsterdam, The Netherlands: Elsevier Science, pp. 9094–9097.Google Scholar

Zhang, B., Hsu, M., & Dayal, U. (2000). k-Harmonic means - a spatial clustering algorithm with boosting. In J.F. Roddick & K. Hornsby (eds.), Spatial, and SpatioTemporal Data Mining. Lecture Notes in Computer Science, Vol. 2007, Berlin, Heidelberg: Springer, pp. 31–45.Google Scholar

Zhang, B., Hsu, M., & Dayal, U. (2000). Harmonic average based clustering method and system. US Patent 6,584,433.Google Scholar

Article contents

A PROBABILISTIC ℓ1 METHOD FOR CLUSTERING HIGH-DIMENSIONAL DATA

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests