Text Analysis and Mining

David Stuart

doi:10.29085/9781783303465.008

As has been seen throughout this book, data is not just numbers but also text, and the internet gives access to vast quantities of text online. This varies from the individual keywords that have been entered into search engines or are associated with bibliographic records, to huge quantities of unstructured text in the form of web pages, blog posts, microblogging updates and articles in online databases. This chapter considers some of the ways this data may be analysed for insights.

Following a brief discussion of how text analysis may be applied by library and information professionals, the chapter is broadly split into two halves. The first half considers natural language processing and the analysis of chunks of text, discussing machine learning, sentiment analysis and topic analysis. The second half considers techniques related to keywords or n-grams; it considers term frequency and burst detection. The two parts are not unrelated; for example keywords may have been extracted through natural language processing or created independently as part of a classification process, whether formal (e.g. applying subject terms) or informal (e.g. tagging).

Text analysis and mining, and information professionals

Of the three approaches to data science considered in this book (clustering and social network analysis, predictions and forecasting, and text analysis and mining), it is probably text analysis that has the most widespread potential for library and information professionals. While each of the approaches has its applications in the library, and there is plenty of overlap (e.g. predictive search suggestions), there are numerous potential applications for text analysis in the library because of the pivotal role of text in the history of the library and driven by the huge growth in online content and unstructured data.

The goal of creating a universal library, which contains all the books and useful information ever published, is ultimately unattainable, but great strides have been taken to achieve it. While the digital revolution has reduced the physical barriers to such a library bringing vast reductions in the costs of digitising, copying and storing publications, there are inevitably certain limitations that will never be overcome: history is filled with lost works; copyright holders may object to their works being deposited in such an online library; and there is a variety of grey literature and unpublished manuscripts that are not included in any library's collection policy.

Book contents

7 - Text Analysis and Mining

Summary

Access options

Book contents

7 - Text Analysis and Mining

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive