FileMaker Pro Corner

Go to my Published Articles list
Deleting Duplicate Records

There are times when databases, particularly large ones, need cleaning up. One aspect which should be given attention is the deletion of duplicate records. These can cause problems with the integrity of the database. For example, information could be split so that parts of it appeared on different records which were ostensibly the same. Mail-outs could include more than one copy being sent to the same addressee.

Defining a Duplicate

But what are "duplicate records"? Well, that's for the database manager to decide. For example, duplicate records may exist in a simple contacts database if the first name, last name and telephone number are the same in more than one record. Duplicate records may exist in a medical database if the first name, last name and birth date are the same. A motor mechanic's database may contain duplicate records if the vehicle registration number is the same.

To check for duplicate records of names and addresses, for example, the database must have a single field that contains enough information to uniquely identify each person and address. Obviously, a field containing the last name, or even the full name, is usually not adequate, and the address field is too prone to inconsistent data entry to be a good choice. However, a field containing the telephone number would be satisfactory as a distinguishing identifier.

Setting up Fields

In one form or another, the duplicate criterion will be based on the concatenation of a number of fields. Let's assume that we have a database for which we've decided on the criterion. It's a simple contacts database, with fields for names, addresses and telephone numbers. Figure 1 shows the fields in our file.

Figure 1: The fields defined in the database

 

Of particular importance is the field "Duplicate criterion", because it's defined as a calculation (with "text" result) based on the concatenation of three fields which we've decided will identify duplicity. The field definition is:

First Name & " " & Last Name & " " & Home Phone

The concatenation includes spaces only for the purpose of making it easier to read the contents of the field - they're not essential! We'll also need the last two fields, "gDuplicate" and "Mark?", when doing the search for duplicates.

Figure 2 shows one of the records in our file.

Figure 2: A typical record, showing the contents of the "Duplicate criterion" field

 

The fields "Duplicate criterion", "gDuplicate" and "Mark?" would not normally be on our main layout - we'd usually hide them, or place them on another layout, but in this example it's convenient to have them visible.

Writing the Script

We now need to find a way of comparing the data in the "Duplicate criterion" field to establish which records have that data more than once. FileMaker Pro has its own in-built facility for finding duplicates, but we'll proceed here to do it our way because it's a useful exercise in the use of scripting. The routine we'll use is set out in Script 1, which we'll now set up and examine in some detail.

Script Step / Options / Script Parameters

1 Set Error Capture
On
2 Go to Layout Layout #1
3 Find All
4 Replace Mark?
No dialog, Replace data:
Calculation: "NO"
5 Sort Sort Order:
Restore sort order: Duplicate criterion (Ascending)
No dialog
6 Go to Record/Request/Page
First
7 Set Field gDuplicate
Calculation: Duplicate criterion
8 Loop
9 Go to Record/Request/Page
Next
Exit after last
10 If Calculation:
Duplicate criterion = gDuplicate
11 Go to Record/Request/Page
Previous
12 Set Field Mark?
Calculation: "YES"
13 Go to Record/Request/Page
Next
14 Set Field Mark?
Calculation: "YES"
15 Else
16 Set Field gDuplicate
Calculation: Duplicate criterion
17 End If
18 End Loop
19 Enter Find Mode
20 Paste Literal Mark?
Select entire contents: YES
21 Perform Find
22 Sort Sort Order:
Restore sort order: Duplicate criterion (Ascending)
No dialog
23 Go to Layout original layout

Script 1: The script steps used to find duplicate records

 

In the "Script" pop-up menu, choose "ScriptMaker™...". In the "Script Name:" box in the resulting dialogue box, type "Find Duplicate Records" and click on the "Create" button. At the next screen, click on the "Clear All" button, and start entering the script steps shown in Script 1.

Adding script steps is easy - just find the one you need in the list provided and double-click on it. Then choose the options associated with the step you've chosen. For example, if you choose the step "Go to Layout", you can choose the name of the layout from the "Specify:" pop-up menu. If you choose the step "Set Field", you can choose the field to be set, and what it is to be set to, by clicking on the two buttons presented on the screen.

In our script, Line 1 suppresses normal FileMaker Pro alert messages - I invariably start my scripts with this line. "Set Error Capture" suppresses normal FileMaker Pro alert messages. Once a FileMaker Pro script is running, FileMaker Pro sends error-handling information to the "Status (CurrentError)" function. By checking the status of the "Status (CurrentError)" function, you can test each important step in your script. By using the "Status (CurrentError)" function with the control script steps, you can make sure your script performs correctly. When you decide to suppress alerts, it's important that you anticipate as many problems as possible, and that you include clear directions to the user as to what to do when an error condition is encountered.

Line 2 makes sure we go to the layout on which the fields "Duplicate criterion", "gDuplicate" and "Mark?" are located. Lines 3 and 4 cause the data in the field "Mark?" to be changed to "NO" in all records. The purpose of this is to ensure that all records are marked "NO" so that there are none which may be inadvertently still set to "YES" from a previous search for duplicates (or any other search we may perform which makes use of the field "Mark?").

Figure 3 shows the dialogue box for this step.

Figure 3: The "Replace" dialogue box, where 3 choices are offered

 

Note the choice of the third radio button, which will automatically cause the screen in Figure 4 to be revealed.

Figure 4: The dialogue box for specifying the calculation for the "Replace" function

 

Also note that "NO" is text, and therefore needs to be in quotation marks.

Line 5 sorts all the records by the field "Duplicate criterion" so that records with the same "Duplicate criterion" are grouped together. The aim of the script is now to progressively compare the data in the "Duplicate criterion" field with the data which is in that field in the previous record.

Starting with the first record, the script copies the value from the "Duplicate criterion" field into the global field "gDuplicate". The script then goes to the next record and compares the value in "Duplicate criterion" with the value in "gDuplicate":

- If the values match, the record is a duplicate, and the script puts "YES" in the "Mark?" field .

- If the values don't match, the script copies the new "Duplicate criterion" value into "gDuplicate", and doesn't change the "Mark?" field.

The script repeats this procedure until it reaches the last record in the file. This is accomplished in Lines 6 to 18 inclusive, making use of an "If" step inside a "Loop" routine. On the screen, these steps would appear as in Figure 5.

Figure 5: Some of the steps in the script, as seen in ScriptMaker

 

In Lines 19 to 22, the records marked "YES" in the preceding steps are found and sorted, and Line 23 returns the user to the layout from which the script was initiated.

Running the Script

Now we can run the script. It can be initiated from the "Script" pop-up menu, or from a button on the data layout. Having found the sorted records which satisfy the duplicate criterion, as in Figure 6, it's possible to proceed to examine them one by one, and to determine whether there are any true duplicates which can be deleted, or whether information needs to be transferred from one to another before doing so.

Figure 6: One of the duplicate records in the file (compare it with Figure 2)

 

Remember that deletion is final - you can't recover deleted records! Always have a current backup of your file available just in case you make an mistake.

As a matter of interest, Script 1 takes about ninety seconds to run through my contacts database of about one-thousand records on my Performa 6200 which has a 75 Mhz processor. How long did it take for you?

The Alternative Solution

I mentioned earlier in this article that FileMaker Pro has an in-built facility for finding duplicates. It involves the use of the exclamation mark ("!") as the "Find" criterion. We still need to use the field "Duplicate criterion" as the field on which to sort and search, but we don't need to use the field "gDuplicate".

To write the script, it's simply a matter of duplicating our previous script and calling it, say, "Find using "!"". We can then edit the copy by choosing unwanted steps, one at a time, with the mouse button and hitting the "Delete" key. The one change we need to make is to set the "Find" criterion to "!". The resulting script would look like Figure 7.

Figure 7: The script for finding duplicates using FileMaker's "!" criterion

 

Try running it - it's much quicker!


You can download a copy of this article in PDF format from the appropriate link below. To view the file you'll need Acrobat™ or Acrobat™ Reader from Adobe Systems Inc. Any FileMaker Pro files used in this article will also be downloaded.

Mac OS

MAC FILES

Windows

WIN FILES


Go to my Published Articles list

Copyright © 1998 Hilton Brown - All rights reserved