Friday, December 03, 2010

How to Stop Duplicates from Getting into Your Database

I’m starting this blog post by saying I was tempted to just write “You Can’t” and leave it at that! Thus you will see that there is no perfect answer to stopping duplicate records (a.k.a. dupes) from getting into your system, but that is no reason not to try and so I hope that the following will help you.

The first thing to confirm, therefore, is that it is almost impossible to have zero dupes in your database. This may be because of human error, unknown pieces of biographical/contact information, individuals not updating you with changes of address, individuals giving you different personal details on different occasions, or from technological problems such as poor database structure, lack of data integrity or weak duplicate checking tools. And data degrades. Fast. Millions of people move address every year and data management and data quality considerations on databases often slide after any initial implementation.

So first, consider how you are inputting new records and the different sources. You almost certainly will be doing some manual data entry and it is quite probable you might also be importing new records electronically, for example from a fulfilment house, your web site or another system within your organisation. Secondly, don’t forget that is not only when you enter new records when you might be creating a duplicate; updating records can also create dupes, and again this might come from manual keying of data or from loading data from an outside source.

The first line of defence against dupes is actually a simple one: user training. Make users aware of the issues of creating dupes and what it can mean to your organisation. Train them how to check for dupes, how to enter data consistently and accurately, how to ask for full and accurate supporter information and so on. But of course they can forget or do it badly or the name/address details may be too different/complex for them to be able to find a dupe simply, so don’t just rely on this approach.

Secondly, if you haven’t already, consider if there are ways of improving data integrity and accuracy through the database technology you currently have. For example, if you store counties, ensure they are in a look-up (a.k.a. drop-down) table; limit post code fields to 8 characters (ideally, check the data format!); split the name fields; and help users check for dupes by enabling them to use “wildcards” when searching for existing records. (e.g. w*wright will find Wainwright, Wainewright, Waynewright etc).

The next thing to have is a duplicate record checker which is native to your database and which is automatically invoked when you add/update records. At this point of data entry, you actually want such a check to be “broad” rather than “100% precise”. i.e. when you enter a new name and address, you don’t want the system to check for an exact match on all such data fields. You want it to be able to check whether there could be a duplicate record based on some of the criteria you have entered. Take my name for example: Ivan Wainewright. If you keyed a new record from me but forgot that my surname had an e in the middle, then an exact check would not find me. So, the dupe checker should check on a set number of characters within a name and address fields.

Ideally, such dupe checkers should even use fuzzy matching, something which the latest Enterprise level of SQL Server now comes with as standard. i.e. With fuzzy matching, a dupe check should find my surname with or without the e, and if I had a vanity address (e.g. Rose Cottage, 1 The Avenue etc), then it should be able to account for that too. It’s a pretty powerful tool and well worth using if you can.

But when you import data electronically you need to consider dupes slightly differently. If you have thousands of records being imported then clearly you can’t check each one by hand. But if you do want your import process to merge dupe records automatically, then in this instance, if you have too “broad” a dupe check then you could end up merging two records which the system thinks might be the same but which may well not be. And there isn’t a much worse thing to do in terms of supporter management! So, for electronic imports, you may decide that you do need a 100% match on data items (or as close to 100% as we can ever really be) in order to know that there is an existing record on your system; or maybe even consider asking for human interaction on a limited number of records which the system cannot definitely match but where it reports there could be a dupe.

Specialist software can also help with all of the above, the most common such system being PAF software, which will help you add and update addresses accurately by using the Royal Mail’s post code system. It isn’t the cheapest of options if you need many licenses but it’s well worthwhile. (And you might only need such software for users who do add/update biographical data). You can also get online PAF systems to help with data entry on web forms.

You might also want to consider using email addresses and mobile phone numbers as additional dupe checks; just because someone moves house doesn’t mean they get a new email address or mobile number, so see if that can help.

You could also introduce a supporter self-update system on your web site which supporters can fill in if they change address. And if they do then remember, there is very little which is more indicative of a keen supporter than someone who pro-actively tells you they have moved house. Treat them well! And if you do collect data online/electronically, whether it is on your web site or through a third-party, then it’s clearly going to be more accurate if you can add it to your central database electronically rather than re-keying it.

But despite all your best efforts, it is highly likely you will create dupes, so you also need to have a regular data cleaning protocol. Again, your database may be able to help by running a duplicate record report on your records (and at different ‘levels of confidence’ so you can automate some merges and check others manually), and/or you can extract your data to analyse and check it outside your system. There are plenty of good dupe checking software packages and there are lots of agencies and companies who will help you clean your data. You can also use services such as the NCOA (National Change of Address) register to check for people who have moved house and thus identify dupes that way.

And a final thought on a far too common issue for charities: if you have multiple databases in your organisation, then, amongst all sorts of other problems, if you transfer data between them then you can significantly increase the likelihood of dupes – cut down on multiple databases whenever possible.

No comments: