Social networking websites create new ways for engaging people belonging to different communities (Baumer et al., Reference Baumer, Sinclair and Tomlinson2010). Social networks allow users to communicate with people exhibiting different moral and social values. The websites provide a very powerful medium for communication among individuals that leads to mutual learning and sharing of valuable knowledge (Sorensen, Reference Sorensen2009). The most popular social networking websites are Facebook, LinkedIn, and MySpace where people can communicate with each other by joining different communities and discussion groups. Social networking can solve coordination problems among people that may arise because of geographical distance (Evans et al., Reference Evans, Kairam and Pirolli2010; Li et al., Reference Li, Li, Khan and Ghani2011b) and can increase the effectiveness of social campaigns (Li & Khan, Reference Li and Khan2009a, Reference Li and Khan2009b; Baumer et al., Reference Baumer, Sinclair and Tomlinson2010) by disseminating the required information anywhere and anytime. However, in social networking websites, people generally use unstructured or semi-structured language for communication. In everyday life conversation, people do not care about the spellings and accurate grammatical construction of a sentence that may lead to different types of ambiguities, such as lexical, syntactic, and semantic (Sorensen, Reference Sorensen2009). Therefore, extracting logical patterns with accurate information from such unstructured form is a critical task to perform.
Text mining can be a solution of above-mentioned problems. Owing to the increasing number of readily available electronic information (digital libraries, electronic mail, and blogs), text mining is gaining more importance. Text mining is a knowledge discovery process used to extract interesting and non-trivial patterns from natural language (Sorensen, Reference Sorensen2009). The technique comprises of multidisciplinary fields, such as information retrieval, text analysis, natural language processing (NLP), information classification, and database technology. In Liu and Lu (Reference Liu and Lu2011), the authors defined text mining as an extension of data mining technique. The data mining techniques are mainly used for the extraction of logical patterns from structured database. Text mining techniques become more complex as compared with data mining owing to unstructured and fuzzy nature of natural language text (Kano et al., Reference Kano, Baumgartner, McCrohon, Ananiadou, Cohen, Hunter and Tsujii2009).
Social networking websites such as Facebook are rich in texts that enable user to create various text contents in the form of comments, wall posts, social media, and blogs. Owing to ubiquitous use of social networks in recent years, an enormous amount of data are available via the Web. Application of text mining techniques on social networking websites can reveal significant results related to person-to-person interaction behaviours. Moreover, text mining techniques in conjunction with social networks can be used for finding general opinion about any specific subject, human thinking patterns, and group identification (Aggarwal, Reference Aggarwal2011). Recently, researchers used decision trees (DTs) and hierarchical clustering (text mining techniques) for group recommendation in Facebook where user can join the group based on similar patterns in user profiles (Baatarjav et al., Reference Baatarjav, Phithakkitnukoon and Dantu2008).
For the past few years there has been a lot of research in the area of text mining. In the scientific literature (Yin et al., Reference Yin, Wang, Qiu and Zhang2007; Tekiner et al., Reference Tekiner, Aanaiadou, Tsuruoka and Tsuji2009; Jo, Reference Jo2010; Ringel et al., Reference Ringel, Teevan and Panovich2010), various text mining techniques are suggested to discover textual patterns from online sources. In Baharum et al. (2010), the authors restrict the analysis to techniques that are specifically associated with text document classification. Brucher stated various clustering-based approaches for document retrieval and compared different clustering techniques for logical pattern extraction from unstructured text, but most of the techniques presented in the papers are not recent (Brucher et al., Reference Brucher, Knolmayer and Mittermayer2002). In Durga and Govardhan (Reference Durga and Govardhan2011), the authors proposed a new model for textual categorization to capture the relations between words by using WordNet ontology (Xu et al., Reference Xu, Zhang and Niu2008). The proposed approach maps the words comprised of same concepts into one dimension and present better efficiency for text classification. In Xu et al. (Reference Xu, Zhang and Niu2008), the authors indicated a best practice in information extraction process based on semantic reasoning capabilities and highlighted various advantages in terms of intelligent information extraction. The author explained the suggested methods, such as query expansion and extraction for semantic-based document retrieval, but did not mention any results associated with the experiments. In Tekiner et al. (Reference Tekiner, Aanaiadou, Tsuruoka and Tsuji2009), the author introduced general text mining framework to extract relevant abstract from large text data of research papers. However, the proposed approach neglected the semantic relations between words in sentences.
Most of the scientific literature (Xu et al., Reference Xu, Zhang and Niu2008; Tekiner et al., Reference Tekiner, Aanaiadou, Tsuruoka and Tsuji2009; Li et al., Reference Li, Khan, Li, Ghani, Bouvry and Zhang2011a) focuses on specific techniques of text mining for information extraction from text documents. However, a thorough discussion is lacking on the actual analysis of different text mining approaches. Most of the surveys emphasize on the application of different text mining techniques on unstructured data but do not specifically target the datasets in social networking websites. Moreover, the existing research papers cover the text mining techniques without mentioning the pre-processing phase (Yin et al., Reference Yin, Wang, Qiu and Zhang2007; Xu et al., Reference Xu, Zhang and Niu2008) that is an important phase for the simplification of text mining process. In contrast, this survey attempts to address all the above-mentioned deficiencies by providing a focused study on the application of all (classification and clustering) text mining techniques in social networks where data is unstructured.
The rest of the survey is organized as follows. Section 2 presents different pre-processing techniques. Section 3 describes and different classification-based algorithms for text mining in social networks. In Section 4, the clustering techniques used for text mining are described. Section 5 presents current challenges and future directions. Finally, Section 6 concludes this survey.
2 Pre-processing in text mining
During the text gathering process, the text may be loosely organized and can be interpreted as irrational text integration or missing information. If the text has not been scanned carefully to identify the problems (as reported in Section 1), then text mining might lead to the ‘garbage in garbage out’ phenomena (Dai et al., Reference Dai, Kakkonen and Sutinen2011). Unstructured text may lead to poor text analysis that affects the accuracy of an output (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008). The pre-processing phase organizes documents into a fixed number of pre-defined categories. Pre-processing guarantees successful implementation of text analysis, but may consume considerable processing time (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008). There are two basic methods of text pre-processing: (a) feature extraction (FE) and (b) feature selection (FS), which are detailed in the subsequent sections.
2.1 Feature extraction
The process of FE can be further categorized as: (a) morphological analysis (MA), (b) syntactical analysis (SA), and (c) semantic analysis. MA deals with individual words represented in a text document and mainly consists of tokenization, remove-stop-word, and stemming-word (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008). In tokenization the document is treated as a sequence of word strings and splits word by removing punctuations (Negi et al., Reference Negi, Rauthan and Dhami2010). In remove-stop-word phase, stop words, such as ‘the’, ‘a’, and ‘or’ are removed. Remove-stop-word phase improves the effectiveness and efficiency of text processing because the number of words in the document are reduced (Shekar & Shoba, Reference Shekar and Shoba2009). Stemming-word is the linguistic normalization technique generally used to reduce a word to the root form, such as the word ‘honesty’ can be reduced to root form of ‘honest’ or the word ‘walking’ can be reduced to the root form of ‘walk’. Different stemming algorithms are available in the literature, such as brute-force, suffix-stripping, affix-removal, successor variety, and n-gram (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008; Shekar & Shoba, Reference Shekar and Shoba2009).
To interpret a logical meaning from a sentence, a grammatically correct sentence is required (Yuan, Reference Yuan2010). SA provides knowledge about the grammatical structure of a language that is often termed as syntax. For instance, the English language comprises of noun, verb, adverb, punctuation, and other parts of speech. The SA technique comprises of: (a) part-of-speech tagging (POS tagging) and (b) parsing.
The POS tagging process is commonly used to add contextually related grammatical knowledge of a single word in a sentence. If the lexical class of the word is known, then performing linguistic analysis becomes much easier (Yoshida et al., Reference Yoshida, Tsuruoka, Miyao and Tsujii2007). Various approaches are mentioned in the scientific literature for implementing POS tagging based on dictionaries (Yuan, Reference Yuan2010). The most promising approaches used are rule-based MA and stochastic model, such as Hidden Markov Model (HMM). In a rule-based approach, the text is decomposed into tokens that can be further used for analysis. Moreover, HMM is a stochastic tagging technique mainly used to discover the most similar POS tagging from sequence of input tokens (Yuan, Reference Yuan2010). Parsing is a technique used for examining the grammatical structure of a sentence. The sentence is represented in a tree-like structure, termed as parse tree, that is mainly used for analysis of correct grammatical order of a sentence. A parse tree can be constructed by using a top-down or bottom-up approach (Ling et al., Reference Ling, Bali and Salam2006).
To fulfil the needs of a distributed knowledge society, available natural communication tools must understand the meaning of a sentence (Strapparava & Ozbal, Reference Aci, Inan and Avci2010). Keyword spotting technique is used to determine the useful contents from the textual message (Ling et al., Reference Ling, Bali and Salam2006). The keyword spotting technique is completely based on the WordNet-Affect, which is a semantic lexicon commonly used for the categorization of words that express similar emotions (Strapparava & Ozbal, Reference Aci, Inan and Avci2010). Another example is SentiWordNet that generally uses WordNet synonyms for measuring the emotions on the basis of two scales, such as positive emotions (happiness) and negative emotions (hate) (Esuli & Sibastiani, Reference Esuli and Sibastiani2006). A state-of-the-art comparison between keyword spotting and semantic analysis has been presented in Ling et al. (Reference Ling, Bali and Salam2006). Ling analyzed the sentence syntactically and identified the basic emotions by analyzing words with respect to context and structure patterns (Ling et al., Reference Ling, Bali and Salam2006).
The keyword spotting technique is based on keywords specifically used for the description of certain emotions in the text (Wollmer et al., Reference Wollmer, Eyben, Keshet, Graves, Schuller and Rigool2009). For instance, in English language verb, noun, and adjective can be used as the keywords for emotion detection. However, the basic disadvantage of keyword spotting technique is the dependency on the presence of obvious affective words in the text. For instance, the emotion ‘sadness’ cannot be derived from the sentence ‘I lost my money’, as the sentence does not specifically mention the word ‘sad’.
To overcome the limitations of keyword spotting and to achieve true understanding the authors Ling et al. (Reference Ling, Bali and Salam2006) introduced a new paradigm named semantic networks. Semantic networks are used to represent concepts, events, and relationships between them. The authors Ling et al. (Reference Ling, Bali and Salam2006) concluded the paper by declaring better performance in detecting human emotions using semantic networks versus keyword spotting because the semantic networks do not depend on detecting emotions based on keywords. In semantic networks the emotions are detected based on the contextual information. The authors Ling et al. (Reference Ling, Bali and Salam2006) and Li and Khan (Reference Li and Khan2009a, Reference Li and Khan2009b) explained the method specifically but did not mention any results associated with the experiments. Moreover, a very large database is required, such as a combination of WordNet-Affect and SentiWordNet for increasing the accuracy of the results.
2.2 Feature selection
The basic purpose of FS is to eliminate irrelevant and redundant information from the target text. FS selects important features by scoring the words. The importance of the word in the document is represented by the assigned score (Hua et al., Reference Hua, Tembe, Dougherty and Edward2009).
The text document is represented as a vector space model. In a vector space model, each dimension represents a separate term as a single word, keyword, or a phrase. Document matrix can be represented with n documents and m terms where any non-zero entry in the matrix indicates the presence of a term in the document (Hua et al., Reference Hua, Tembe, Dougherty and Edward2009). Feature vectors represents document feature. Two basic methods have been proposed to calculate feature vectors: (a) term frequency (TF) and (b) inverse document frequency (IDF) (Shekar & Shoba, Reference Shekar and Shoba2009).
TF determines how often a term is found in a collection of documents. The information about the topic of the document can be identified by the number of occurrences of a term associated with the topic. IDF considers the least frequent words in the document that have information about the topic. Whereas, TFIDF technique is the combination of term TF and IDF and is mainly used for calculating the frequency and relevancy of a given word in a document (Yoshida et al., Reference Yoshida, Tsuruoka, Miyao and Tsujii2007).
Ma et al. (Reference Ma, Helmut and Mitsuru2005) and Li et al. (Reference Li, Li, Khan and Ghani2011b) used similarity measuring techniques for text pre-processing. Similarity measuring is a technique used to measure the affinity of any group of words or phrases to occur together frequently. Similarity measuring can be further categorized on the basis of grammatical construct, such as ‘because of’ and semantic relations, such as ‘student’ and ‘teacher’. The terms have a high probability to occur in close proximity and exhibit affinity for each other. In Ma et al. (Reference Ma, Helmut and Mitsuru2005), emotional weights are assigned to each word to determine specific emotions. However, similarity measure technique performs poorly when handling complicated sentence structures. For instance, the sentence ‘This was not a failure for our company’ most likely represents success. However, the word ‘failure’ has 75% probability of expressing negative emotion. For perfect implementation of the similarity measuring techniques, the user needs to have a sufficient corpus. Moreover, scaling an algorithm to such large data sets, particularly available on the Web, still needs to be addressed (Zhao et al., Reference Zhao, Han and Sun2009).
In scientific literature (Yoshida et al., Reference Yoshida, Tsuruoka, Miyao and Tsujii2007), different FS techniques such as (a) latent semantic indexing (LSI) and (b) random mapping (RM) are discussed. LSI tends to improve the lexical matching by adopting a semantic approach, as in the case of semantic analysis, while RM creates a map through the contents of a large document set. Any selected region in a map can further be used for the extraction of new documents on similar topics. A pre-processed document can be represented as in Figure 1. Durga and Govardhan (Reference Durga and Govardhan2011) present the two most commonly used text mining techniques for text analysis in social networking: (a) text mining using classification (supervised) and (b) text mining using clustering (unsupervised).
3 Text mining using classification
Supervised learning or classification is the process of learning a set of rules from a set of examples in a training set. Text classification is a mining method that classifies each text to a certain category (Yin et al., Reference Yin, Wang, Qiu and Zhang2007). Classification can be further divided into two categories: (a) machine learning-based text classification (MLTC) and (b) ontology-based text classification (Xu et al., Reference Xu, Zhang and Niu2008) and is illustrated in Figure 2.
3.1 Machine learning-based text classification
MLTC comprises of quantitative approaches to automate NLP that uses machine learning algorithms. Preferred supervised learning techniques for text classification are described in the subsequent text.
3.1.1 Rocchio algorithm
Different words with similar meanings in a natural language are termed as Synonymy. Synonymy can be addressed by refining the query or document using the relevance feedback method. In the relevance feedback method, the user provides feedback that indicates relevant material regarding the specific domain area. The user asks a simple query and the system generates initial results in response to the query. The user marks the retrieved results as either relevant or irrelevant. Based on the user-marked results the algorithm may perform better. The relevance feedback method is an iterative process and plays a vital role by providing relevant material that tracks user information needs (Liu & Lu, Reference Liu and Lu2011).
Rocchio algorithm is an implementation of the relevance feedback method and is mainly used for document refinement. However, the relevance feedback algorithms have drawbacks as illustrated in Liu and Lu (Reference Liu and Lu2011). The user must have sufficient knowledge to indicate relevance feedback (Luger, Reference Luger2008). Moreover, the relevance feedback algorithm may not work efficiently when the user spells a word in a different way. Various spelling correction techniques can be used at the cost of computation and response time, such as hashing-based and context-sensitive spelling correction techniques (Udupa & Kumar, Reference Udupa and Kumar2010).
3.1.2 Instance-based learning algorithm
Instance-based learning algorithms (also known as lazy algorithms) are based on the comparison between new problem instances and instances already stored during training (Chang & Poon, Reference Chang and Poon2009). On arrival of a new instance, sets of related instances are retrieved from the memory and further processed so the new instance can be classified accordingly. Algorithms exhibiting instance-based learning approaches are described in the subsequent text.
K-nearest neighbour algorithm is a form of instant-based learning. The algorithm categorizes similar objects based on the closest feature space in the training set. The closest feature space may be determined by measuring the angle between the two feature vectors or by calculating the Euclidean distance between the vectors. For more details, we encourage the readers to browse Chang and Poon (Reference Chang and Poon2009).
Case-based reasoning comprises of three basic steps: (1) classification of a new case by retrieving appropriate cases from data sets, (2) modification of the extracted case, and (3) transformation of an existing case (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008). Textual case-based reasoning (TCBR) primarily deals with textual knowledge sources in making decisions. A novel TCBR system, named SOPHIA-TCBR has been detailed in Patterson et al. (Reference Patterson, Rooney, Galushka, Dobrynin and Smirnova2008) for organizing semantically related textual data into a group. Patterson et al. (Reference Patterson, Rooney, Galushka, Dobrynin and Smirnova2008) stated better results of knowledge discovery in the SOPHIA-TCBR system. However, in the TCBR approach, extracting similar cases and representing knowledge without losing key concepts with low knowledge engineering overhead are still challenging issues for researchers (Patterson et al., Reference Patterson, Rooney, Galushka, Dobrynin and Smirnova2008).
3.1.3 Decision trees and Support Vector Machine
Relationships, attributes, and classes in ontology can be structured hierarchically as taxonomies (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008). The process of constructing lexical ontology by analyzing unstructured text is termed as ontology refinement. DT is a method to semantically describe the concepts and the similarities between the concepts (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008). Different algorithms of DT are used for classification in many application areas, such as financial analysis, astronomy, molecular biology, and text mining. As text classification depends on a large number of relevant features, an insufficient number of relevant features in a DT may lead to poor performance in text classification (Forman & Kirshenbaum, Reference Forman and Kirshenbaum2008).
Support Vector Machine (SVM) algorithm is used to analyze data in classification analysis. In contrast to other classification methods, SVM algorithm uses both negative and positive training data sets to construct a hyper plane that separates the positive and negative data. The document that is closest to decision surface is called support vector.
3.1.4 Artificial neural networks
Artificial neural networks (ANN) are parallel distributed processing systems specifically inspired by the biological neural systems (Jo, Reference Jo2010).The network comprises of a large number of highly interconnected processing elements (neurons) working together to solve any specific problem. Owing to their tremendous ability to extract meaningful information from a huge set of data, neurons have been configured for specific application areas, such as pattern recognition, FE, and noise reduction. In the neural network, connection between two neurons determines the influence of one neuron on another, while the weight on the connection determines the strength of the influence between the two neurons (Jo, Reference Jo2010).
There are two basic categories of learning methods used in neural networks: (a) supervised learning and (b) unsupervised learning. In supervised learning, the ANN gets trained with the help of a set of inputs and required output patterns provided by an external expert or an intelligent system. Different types of supervised learning ANNs include: (a) back propagation and (b) modified back propagation neural networks (Luger, Reference Luger2008). Major application areas of supervised learning are pattern recognition and text classification (Jo, Reference Jo2010; Kolodziej et al., Reference Kolodziej, Burczynski and Khan2012). In unsupervised learning (clustering), the neural network tends to perform clustering by adjusting the weights based on similar inputs and distributing the task among interconnected processing elements (Luger, Reference Luger2008).
The field of text mining is gaining popularity among researchers because of enormous amount of text available via Web in the form of blogs, comments, communities, digital libraries, and chat rooms. ANN can be used for the logical management of text available on Web. Jo proposed a new neural network architecture for text categorization with document presentation called Neural Text Categorizer (NTC) (Jo, Reference Jo2010). NTC comprises of three layers: (a) input layer, (b) output layer, and (c) learning layer. Input layer is directly connected with output layer, whereas learning layers determine the weights between input and output layer. The proposed approach can also be used for organizing the text in social networks (Jo, Reference Jo2010).
3.1.5 Genetic algorithms
A genetic algorithm (GA) is a heuristic search that simulates the natural environment of biological and genetic evolution (Luger, Reference Luger2008; Kolodziej et al., Reference Kolodziej, Khan and Xhafa2011). Multiple solutions of a problem are presented in the form of a genome. The algorithm creates multiple solutions and applies genetic operators to determine the best offspring. GAs are widely used to solve optimization problems. Therefore, researchers are trying to use the utility of GAs in social networking websites (Luger, Reference Luger2008; Guzek et al., Reference Guzek, Pecero, Dorronsoro, Bouvry and Khan2010).
A GA was used for FS and termed weight method in Khalessizadeh et al. (Reference Khalessizadeh, Zaefarian, Nasseri and Ardil2006) for assigning weights to each concept in the document on the basis of relevant topics. Weighted topic standard deviation was the proposed formula used to present the concentration of a topic in a document as a fitness function. As the process is recursive, an end function needs to be specified based on monitoring the improvement of results in the consecutive generations. In Khalessizadeh et al. (Reference Khalessizadeh, Zaefarian, Nasseri and Ardil2006), the authors revealed better results by using a GA for text classification.
3.2 Ontology-based text classification
Statistical techniques for document representation (as described in Section 3.1) are not sufficient because the statistical approach neglects the semantic relations between words (Luger, Reference Luger2008). Consequently, the learning algorithm cannot identify the conceptual patterns in the text (Luger, Reference Luger2008). Ontology can be the solution of the problems by introducing explicit specification of conceptualization based on concepts, descriptions, and the semantic relationships between the concepts (Zhao et al., Reference Zhao, Han and Sun2009; Li et al., 2012a). Ontology represents semantics of information and is categorized as: (a) domain ontology consists of concepts and relationship of the concepts about a particular domain area, such as biological ontology or industrial ontology and (b) ontology instance related with automatic generation of web pages (Luger, Reference Luger2008).
Basic components of ontology include (a) classes, (b) attributes, (c) relations, (d) function terms, and (e) rules (Wimalasuriya & Dou, Reference Wimalasuriya and Dou2010). Ontology needs to be specified formally (Luger, Reference Luger2008). Formal relation can be represented as (a) classes and (b) instances (Zhao et al., Reference Zhao, Han and Sun2009). Ontology-based languages are declarative languages and generally express the logic of computation based on either first-order logic or description logic. For instance, the W3C organization introduced standardized Ontology Web Language that supports interpretability of language by providing additional vocabulary with formal semantics (Xu et al., Reference Xu, Zhang and Niu2008). Common Logic and Semantic Application Design Language (Wimalasuriya & Dou, Reference Wimalasuriya and Dou2010) are the popular ontology-based languages commonly used for semantic evaluation of data sets available in social networking websites.
Online information usually resides in digital libraries in the form of online books, conference, and journal papers. In digital libraries, searching techniques are based on a traditional keyword matching approach that may not satisfy requirements of users owing to lack of semantic reasoning capabilities. Xu recommended an ontology-based digital library system that analyzed the query with respect to semantic meanings and revealed better results when compared with traditional keyword-based searching approach (Xu et al., Reference Xu, Zhang and Niu2008). However, semantic analysis is computationally expensive and challenging for researchers, especially for large text corpora such as text data in social networking websites (Xu et al., Reference Xu, Zhang and Niu2008).
3.3 Hybrid approach
Different classification algorithms have been used for text classification and analysis. However, literature (Miao et al., Reference Miao, Duan, Zhang and Jiao2009; Aci et al., Reference Aci, Inan and Avci2010; Li et al., Reference Li, Li, Khan and Ghani2011b; Meesad et al., Reference Meesad, Boonrawd and Nuipian2011) shows that the combination of different classification algorithms (hybrid approach) provides better results and increased text categorization performance instead of applying a single pure method. The result of applying hybrid approach to large text corpora heavily depends on the test data sets. Therefore, there is no guarantee that a high level of accuracy acquired by one test set will also be obtained in another test set. Moreover, for better performance of the hybrid approach, several parameters need to be defined or initialized in advance. Table 1 provides an overview of different hybrid approaches used for text classification that can be further used for the text analysis in social networking. However, selecting the classification approach for text analysis in social networks totally depends on the data set and nature of the problem being investigated (Miao et al., Reference Miao, Duan, Zhang and Jiao2009).
ANN, artificial neural networks; RA, Rocchio algorithm; DT, decision tree; SVM, Support Vector Machine; K-NN, k-nearest neighbour; GA, genetic algorithm.
The result of the analysis shows that SVM and ANN performed well in several comparisons. The main purpose of the comparison of hybrid approach is to highlight the applicability of different classification algorithms and complement their limitations (Aci et al., Reference Aci, Inan and Avci2010).
4 Text mining using clustering
Document clustering includes specific techniques and algorithms based on unsupervised document management (Jain, Reference Jain2010). In clustering the numbers, properties, and memberships of the classes are not known in advance. Documents can be grouped together based on a specific category, such as medical, financial, and legal.
In scientific literature (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011), different clustering techniques are comprised of different strategies for identifying similar groups in the data. The clustering techniques can be divided into three broad categories: (a) hierarchical clustering, (b) partitional clustering, and (c) semantic-based clustering that are detailed in the subsequent text.
4.1 Hierarchical clustering
Hierarchical clustering organizes the group of documents into a tree-like structure (dendrogram) where parent/child relationships can be viewed as a topic/subtopic relationship. Hierarchical clustering can be performed either by using (a) agglomerative or (b) divisive methods, which are detailed in the subsequent text (Kavitha & Punithavalli, Reference Kavitha and Punithavalli2010).
An agglomerative method uses a bottom-up approach by successively combining closest pairs of clusters together until the entire objects form one large cluster (Kavitha & Punithavalli, Reference Kavitha and Punithavalli2010). The closest cluster can be determined by calculating the distance between the objects of n-dimensional space. Agglomerative algorithms are generally classified on the basis of inter-cluster similarity measurements. The most popular inter-cluster similarity measures are single-link, complete-link, and average-link (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011). Several algorithms are proposed based on the above-mentioned approach, such as Slink, Clink, and Voortices use single-link, complete-link, and average-link, respectively. The Ward algorithm uses both the agglomerative as well as divisive approach as illustrated in Figure 3. The only difference between the aforementioned algorithms is the method of computing the similarity between the clusters.
In Yonghong and Wenyang (Reference Yonghong and Wenyang2010), the authors suggested agglomerative hierarchical clustering techniques for text clustering. First, GA was applied to achieve the FS phase in the text document. Second, similar document sets were grouped together into small clusters. Finally, the authors proposed text clustering algorithm to merge all clusters into final text cluster (Yonghong & Wenyang, Reference Yonghong and Wenyang2010). The proposed approach can be used for grouping the similar text from social networking websites, such as blogs, communities, and social media.
The divisive method uses a top-down approach by starting with the same cluster and recursively splitting the cluster into smaller clusters until each document is in a classified cluster (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011). The computations required by divisive clustering are more complex as compared with the agglomerative method. Therefore, the agglomerative approach is the more commonly used methodology.
Hierarchical clustering is very useful because of the structural hierarchical format. However, the approach may suffer from a poor performance adjustment once the merge or split operations are performed that generally leads to lower clustering accuracy (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011). Moreover, the clustering approach is not reversible and the derived results can be influenced by noise.
4.2 Partitional clustering
Partitional clusters are also known as non-hierarchical clusters (Kavitha & Punithavalli, Reference Kavitha and Punithavalli2010). To determine the relationship between objects, partitional clustering uses a feature called vector matrix. Features of every object are compared and objects comprised of similar patterns are placed in a cluster (Liu & Lu, Reference Liu and Lu2011). The partitional clustering can be further categorized as iterative partitional clustering, where the algorithm repeats itself until a member object of the cluster stabilizes and becomes constant throughout the iterations. However, the number of clusters should be defined in advance (Liu & Lu, Reference Liu and Lu2011). Different forms of the iterative partitional cluster-based approaches are described as follows.
4.2.1 K-mean, k-medoid, c-mean, and c-medoid
In the k-mean approach the data set is divided into k clusters (Jain, Reference Jain2010). Each cluster can be represented by the mean of points termed as the centroid. The algorithm performs in a two-step iterative process: (1) assign all the points to the nearest centroid and (2) calculate the centroids for a newly updated group (Jain, Reference Jain2010). The iterative process continues until the cluster centroid becomes stabilized and remains constant (Liu & Lu, Reference Liu and Lu2011).
The k-mean algorithm is widely used because of the straightforward parallelization (Jain, Reference Jain2010). Moreover, k-mean algorithm is insensitive to data ordering and works conveniently only with numerical attributes. However, the optimum value of k needs to be defined in advance (Liu & Lu, Reference Liu and Lu2011).
The k-medoid algorithm selects the object closest to the centre of the cluster to represent the cluster (Jain, Reference Jain2010). In the algorithm, the k object is selected randomly. Based on the selected object, distance is computed. The nearest object with respect to k will form a cluster. Remaining objects take the place of k recursively until the quality of the cluster is improved (Liu & Lu, Reference Liu and Lu2011). The k-medoid algorithm has many improved versions, such as PAM (Partitioning Around Medoid), CLARA (Clustering Large Applications), and CLARANS (Clustering Large Applications Based Upon Randomized Search). K-medoid algorithms work well for small data sets, but give compromised results for large data sets (Liu & Lu, Reference Liu and Lu2011).
C-mean is a variation of k-mean that exhibits a fuzzy clustering concept that generates a given number of clusters with fuzzy boundaries and allows overlapping of clusters (Chen & Wang, Reference Chen and Wang2009). In overlapping clusters process, the boundaries of clusters are not clearly specified. Therefore, each object belongs to more than one cluster. Fuzzy c-mean (Chen & Wang, Reference Chen and Wang2009), and fuzzy c-medoids (Hang et al., Reference Hang, Honda, Ichihashi and Notsu2008) algorithms are widely used examples of c-mean algorithm (Li et al., 2012), as illustrated in Figure 3.
4.2.2 Single-pass algorithm
The single-pass algorithm is the simplest form of partitional clustering (Mehmed, Reference Mehmed2011). The algorithm starts with empty clusters and randomly selects a document as a new cluster with only one member (Mehmed, Reference Mehmed2011). Single-pass algorithm calculates a similarity coefficient by considering a second object. If the calculated similarity coefficient is greater than the specified threshold value, then the object will be added to the existing cluster, otherwise a new cluster will be created for the object. The BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) algorithm is an example of the single-pass clustering algorithm (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011). The algorithm uses hierarchical data structure called CF (Clustering Feature) tree for partitioning the data sets (Mehmed, Reference Mehmed2011). Nearest neighbour clustering is iterative and similar to the hierarchical single-link method (Mehmed, Reference Mehmed2011).
4.2.3 Probabilistic algorithm
Probabilistic clustering is an iterative method that calculates and assigns probabilities for the membership of an object (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011). Based on the probability measurements, an object can be a part of any specific cluster. Probabilistic clustering technique is popular because of the ability to handle records of a complex structure in a flexible manner. As probabilistic clustering has clear probabilistic foundations, finding out the most suitable number of clusters becomes relatively easy (Liu & Lu, Reference Liu and Lu2011). Examples of probabilistic clustering are the Exception Maximizing Algorithm and Multiple Cause Mixture Model. However, these approaches are computationally expensive (Sathiyakumari & Manimekalai, Reference Sathiyakumari and Manimekalai2011).
4.3 Semantic-based clustering
Meaningful sentences are composed of logical connections to meaningful words (Liu & Lu, Reference Liu and Lu2011). A logical construction of words is generally provided by machine readable dictionaries, such as WordNet. In semantic-based clustering, the structured patterns are extracted from an unstructured natural language. Moreover, the approach emphasizes meaningful analysis of contents for information retrieval.
Researchers have proposed several algorithms for computing semantic similarities between text, such as Resnik and Lin algorithms (Liu & Lu, Reference Liu and Lu2011) are proposed to measure the semantic similarity of text in a specific taxonomy. Detailed descriptions of these algorithms are presented in Chen and Wang (2009).
Yu and Hsu (2011) introduced a novel approach to automate the ontology construction process based on data clustering and pattern tree mining. The study comprises of two phases: (1) document clustering phase creates a group of related documents using k-mean clustering technique and (2) ontology construction phase creates inter-concept relation from the clustered documents, whereas inter-concept relation is termed as similar concept relationship. The author implemented the proposed approach on weather news collected form e-paper and revealed remarkable results by extracting the regions with high temperature.
5 Current challenges and future directions
Implementing text mining techniques in social networking have several challenges for researchers.
Text in social networks: in social networks, textual data may be large, noisy, and dynamic. Moreover, interpreting emoticons (smile, sad) for expressing any specific concept or emotion is still a challenging issue for researchers. Privacy and trust in online communication is also a major issue. Application of ethical values, such as integrity, veracity, in online communication is the only effective way to build trust online.
Text mining using cloud computing: another challenge of the current era is to implement text mining techniques in cloud-based infrastructure that allow people to access technology-enabled and scalable services via Internet (Yoo, Reference Yoo2012). However, in cloud computing, user may have difficulty in the process of storing and retrieving the document (Yoo, Reference Yoo2012). Automatic document archiving can be performed using the text mining techniques. Moreover, text processing and text aggregation in cloud would be the issues for the researchers.
To overcome the challenges, researchers need to apply different text mining techniques in social networks that can filter out relevant information from the large text corpora. However, determining whether to use clustering or classification approach for text analysis in social networks is still a challenging task that totally depends on the data set and the nature of the problem being investigated. In future, text mining tools can also be used as intelligent agent that can mine user’s personal profiles from social networks and forward relevant information to the users without requiring an explicit request.
6 Concluding remarks
Electronic textual documents are extensively available owing to the emergence of the Web. Many technologies are developed for the extraction of information from huge collections of textual data using different text mining techniques. However, information extraction becomes more challenging when the textual information is not structured according to the grammatical convention. People do not care about the spellings and accurate grammatical construction of a sentence while communicating with each other using different social networking websites (Facebook, LinkedIn, MySpace). Extracting logical patterns with accurate information from such unstructured form is a critical task to perform.
This survey attempts to provide a thorough understanding of different text mining techniques as well as the application of these techniques in the social networking websites. The survey investigates the recent advancement in the field of text analysis and provides a comprehensive overview of all the exiting text mining techniques that can be used for the extraction of logical patterns from the unstructured and grammatically incorrect textual data. This survey will definitely provide new ways for researchers to proceed and develop novel classification or clustering techniques that will be useful for analysis of text in social networks.
We are grateful to Juan Li, Matthew Warner, and Daniel Grages for their feedback on draft of this survey report. Samee U. Khan's work was partly supported by the Young International Scientist Fellowship of the Chinese Academy of Sciences, (Grant No. 2011Y2GA01).