Skip to main content Accessibility help
×
Hostname: page-component-84b7d79bbc-x5cpj Total loading time: 0 Render date: 2024-07-28T12:33:02.747Z Has data issue: false hasContentIssue false

5 - Basic Data Cleansing

Published online by Cambridge University Press:  09 November 2021

Get access

Summary

Cleansing personal data

We have focused a lot on classification and normalisation, but what about other data cleansing such as supplier or customer addresses? I have seen first hand how many duplicates can occur if you’re working with multiple systems.

I had a client with nine different sources of data, which when merged totalled 2.8 million rows of customer information. When I cleansed this, it reduced to 1.3 million rows of data. I certainly didn't do this in Excel, and there are a number of tools available that can help, but if you’re on a budget or need to get something done relatively quickly, I can show you how to achieve this using Excel.

It would be hard to cover all the different formats that addresses can come in. Quite often you will see a mixture of formats, such as everything in one cell or spread across many cells or with the information in the wrong column. I’ll be focusing on names, addresses and e-mails for this exercise, but you will most likely have telephone numbers as well.

I’ll share as an example what I’m going to call my sample customer list (Figure 5.1 on the next page). It's got a name and address and an e-mail column and there is a mixture of upper and lower case data, as well as sentence case. This is not unusual and I see this in a lot of files. Again, it's down to creating some standards and making sure everyone applies them.

As all the formulas used for this process are in previous chapters, there will be less illustrations in this chapter, but feel free to refer back to Chapters 1, 2 and 4 if you’d like some visual support.

Cleansing names in Excel

There are a number of different ways to approach this and it very much depends on what you need for your business. In my sample file, the names are all in one column. Is that how you want them to stay? Or do you want to split them out into first name and last name? This is most likely something that is decided elsewhere in the business, but it doesn't mean you can't use their standards and formatting and be consistent.

Type
Chapter
Information
Between the Spreadsheets
Classifying and Fixing Dirty Data
, pp. 99 - 110
Publisher: Facet
Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Basic Data Cleansing
  • Susan Walsh
  • Book: Between the Spreadsheets
  • Online publication: 09 November 2021
  • Chapter DOI: https://doi.org/10.29085/9781783305049.006
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Basic Data Cleansing
  • Susan Walsh
  • Book: Between the Spreadsheets
  • Online publication: 09 November 2021
  • Chapter DOI: https://doi.org/10.29085/9781783305049.006
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Basic Data Cleansing
  • Susan Walsh
  • Book: Between the Spreadsheets
  • Online publication: 09 November 2021
  • Chapter DOI: https://doi.org/10.29085/9781783305049.006
Available formats
×