Variety: Classification and Clustering

Carlos Castillo

doi:10.1017/CBO9781316476840.005

In 2005 the Inter-Agency Standing Committee (IASC), a permanent forum including agencies from the United Nations (UN) and agencies not belonging to the UN (such as the Red Cross), introduced a number of reforms designed to improve humanitarian response. A visible reform was the establishment of the Cluster System, which organizes large-scale multiagency humanitarian response into eleven areas of action, each one with its own responsibilities: health, protection, food security, emergency telecommunication, early recovery, education, sanitation,water and hygiene, logistics, nutrition, emergency shelter, and camp management and coordination. The Cluster System is not without critics, but it serves to structure response and it is liked by national governments because it introduces a single focal point which is accountable for a specific response area.

This chapter describes methods for automatic text categorization, which allow us to make sense of heterogeneous, varied messages by sorting them into categories. In the same way in which coordination among humanitarian agencies is facilitated by abstracting from specific response actions to response areas, coping with typical crisis collections from social media, involving millions of messages, is made easier by abstracting from the particular (a specific message) to the general (a class of messages).

There are two broad families of classification methods: supervised and unsupervised. In supervised classification, we first manually classify a set of items (messages in this case) into categories using human annotators, and then use these example items to automatically learn a model for classifying new, unseen items into the same categories. In unsupervised classification (or clustering), we do not provide any example item classified a priori, but instead allow a method to discover groups of related items based on their similarity.

We begin with a description of the main information categories found in social media and short text messages during crises (§4.1). Next, we introduce supervised (§4.2) and unsupervised classification methods (§4.3).

Content Categories

The first question when categorizing content is how to determine which information categories to use. There are many factors that drive the design of these categories. The first and most important are the information needs of the users for which the categorization is done, which may include emergency managers, humanitarian relief workers, policy makers, analysts, and/or the public.

Book contents

4 - Variety: Classification and Clustering

Summary

Access options

Book contents

4 - Variety: Classification and Clustering

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive