Data integrity sometimes can be a vexing problem to overcome. Data gets brought into our systems in many different ways: through data entry clerks, online by customers, purchased data, ETL, and on and on. And there is no guarantee, many times, that the data is stored consistently, even if we need it to be. For example, some data is stored with trailing blanks, some with trailing spaces, some with strange character that we have no idea how they go there!

Given that this is the reality of data these days, what can we do? Well there are some data integrity issues that can be cleaned up using a touch of crafty SQL. Consider the common data entry problem of extra spaces in a name field. Not only is it annoying, sometimes it can cause the system to ignore relationships between data elements. For example, “Craig Mullins” is not equivalent to “Craig Mullins” -- the first one has two spaces between the first and last name, but the second one only has one space. Such problems can be difficult to spot with the naked eye, but if you are relying on code to match the two a simple equivalency test will not work. Sure, there are software tools that you can buy that help to clean up data, but what if you don’t have the budget for that?

Consider Using Trusty Old SQL

With a little SQL code you can write an UPDATE statement to fix such problems… that is, if you know how to use the REPLACE function. REPLACE does what it sounds like it would do: it reviews a source string and replaces all occurrences of a one string with another. For example, to replace all occurrences of Z with A in the string RUTZBZGZ you would code:

     REPLACE('RUTZBZGZ', 'Z', 'A')

And the result would be RUTABAGA. But a simple REPLACE is not sufficient for the task at hand, so let’s create a SQL statement to get rid of any number of unwanted spaces in the NAME column of our EMPLOYEE table:

     UPDATE EMPLOYEE
            SET NAME = REPLACE(
                          REPLACE(
                              REPLACE(NAME, SPACE(1), '<>')
                              '><', SPACE(0))
                         '<>', SPACE(1));

What are all of those less-thans and greater-thans doing in this SQL and why do you need them? Let me explain.

The innermost REPLACE statement takes the NAME column and converts every occurrence of a single space into a left/right carat. The next REPLACE (working outward), takes the string we just created, and removes every occurrence of a right/left carat combination by replacing it with a zero length string. The final REPLACE function takes that string and replaces any left/right carats with a single space. The reversal of the carats is the key to removing all spaces except one – remember, we want to retain a single space anywhere there was a single space as well as anywhere that had multiple spaces. If you’ve got this type of data problem give it a try… it works.

Of course, you can use any two characters you like, but the left and right carat characters work well visually. Be sure that you do not choose to use characters that occur naturally in the string that you are acting upon, though! I have yet to meet anybody with a < or > sign in their name, so those should work well.

Finally, the SPACE function was used for clarity. You could have used strings encased in single quotes, but the SPACE function is easier to read. It simply returns a string of spaces, the length of which is specified as the integer argument.

Summary

This is just one type of data integrity problem that you can clean up using SQL and functions. With a little ingenuity and some adept coding you can probably do a lot of data cleansing using nothing but SQL. If you have a favorite SQL data cleansing method, help us out and share it as a comment here on the blog.