Among the many features that we worked on for 6.3, one of the areas we worked on was the import tool wizard. Hopefully you’ll find that it makes things a lot easier for you. One question that has come up is how do we go about adding fields for duplicate checking. So let’s talk a little about that.
First of all, I hate it when I read a post that goes on and on rambling and ranting (as I am about to) and doesn’t give me the details I want until the end. So if you just want the nitty gritty skip down to the section that reads ‘Just tell me how to add the field I want to the list‘. I promise I won’t be offended if you skip ahead (just a little sad).
When would I want to use this?
If you want the full story, then let’s ask ourselves why we would want to add a duplicate check to begin with?
So let’s say we get handed a spreadsheet of 1000 leads from our latest marketing efforts. We need to get these leads into the system, but at the same time we don’t want to create duplicate leads with the same first and last name. Or perhaps our business logic does not allow for users to share the same email address, so we wish to not import any users with an email address that is already in the system. Such scenario’s are the type where Duplicate Fields Checking during Import is helpful.
How does it work?
It might be worth understanding (from a high level) how this works.
1. The field checking currently is done only on fields that are part of an index of type ‘index’. That sounds redundant but in other words it will ignore fields that are only in indexes of type ‘primary’, ‘unique’, ‘Full Text’, etc. We cannot add fields to the dupe check unless they are part of a new index. This is worth repeating, only fields that are part of an index of type index can be added to the dupe check list!
The list also selectively shows the fields that are indexed AND have been chosen for import. So if you have an index on email address field but are not importing the email in your spreadsheet, it will not show up in the list.
2. Once you have selected the fields you wish to check for duplicates against, it will be used for processing during import. Import will process one row at a time, and check each field against other records in the database. So the more fields you choose to check for duplicates against, the more work that is done and longer execution time and resources needed.
3. A list of all records that were found to be duplicates will be presented to the user once processing is done.
JUST TELL ME HOW TO ADD THE FIELDS I WANT!!
Alright already, no need to yell. Here you go.
Indexing a custom field:
Lets say we wish to add a field called newText to the accounts table and want it to be available for dupe checking.
1. Go to Studio and add the new custom field as usual. Make sure the field is marked as importable.
2. Manually, or through phpMyAdmin or some other Database GUI tool, add a new index on the custom table (‘accounts_cstm’) using the new column that was created. REMEMBER THE INDEX TYPE MUST BE ‘INDEX’.
a sample manual query would look like so:
ALTER TABLE accounts_cstm ADD INDEX newTextIndex(newText)
3. Go through import process and make sure you are importing a csv file that has the field mapped. This is important because if the field that is available for dupe checking is not being mapped to a column in the csv, it will not show up as an option in the dupe check drag drop widget.
4. You should see the index available to you.
Indexing an existing Field.
This step is slightly different and requires some extra steps. Lets say that in the accounts table, we decide we also want the option to dupe check against the industry field. Not a helpful use case, but should help us demonstrate the steps.
1. If you have made prior changes in studio, make sure the field is still marked as importable.
2. Verify custom file/directory exists. We need to make sure the change is upgrade safe by defining the index in the custom vardefs. If you do not have the file or directory set up, you will have to create the directories/files. Since we are talking about accounts, we would identify this file:
Note that you can call the file whatever you want as long as you are in the right directory.
3. Add the following entry to the php file in order to add the index to the Account dictionary vardef definition:
$dictionary['Account']['indices'] = array('name' =>'idx_accnt_industry', 'type' =>'index', 'fields'=>array('industry')) ;
Save your work.
4. Now lets go to admin section in the app and run Quick Repair and Rebuild option. This should take your changes and present a db change to execute.
5. Choose to execute the sql to create the new index on the db
6. Go through import process and make sure you are importing a csv file that has the industry field mapped. Again, this is important because if the field that is available for dupe checking is not being mapped to a column in the csv, it will not show up as an option in the dupe check drag drop widget.
7. You should see the index available to you.
Some notes on Performance
Just in case you are wondering why the indexed fields only restriction, we debated amongst ourselves and weighed the options of performance versus flexibility. It would be nice to make any field available to the user right off from the get go without worrying about indexes. However if you imagine importing 5000 users from a spreadsheet into a table that has 1,000,000 records, and running a query for each column being checked (which might not be indexed resulting in a full table scan) for each row, and you can already see the timeout’s and out of memory errors coming. At the very least, the process could take several hours.
The other approach would be to automate index creation from studio, but someone who does not understand the performance implications could easily add an index for each of the columns in a table, resulting in performance degradation during inserts, updates, and deletions which would be really bad. The index creation and removal itself could take a long time as well and possibly lock up the application during business hours.
Most importantly, we knew it was not a good idea if our own Cloud Server Team would most likely disable this feature and then make us sit in the corner with a funny hat on to think about we have done (it builds character too).
Ultimately we consulted our platform architect for a good balanced idea and it was suggested we keep the current approach of only using indexed fields (which we then extended to custom fields). This extra set up work discourages the process of adding fields (for the novice user) but that’s a good thing in this case:)
Are we done yet?
Almost. So to sum up the contents:
1. Existing and custom fields can be to the list of fields available for duplicate check through the use of indexes!
2. The new index(es) must be of type index (for searching, as opposed to types primary or unique).
3. If the index is on an existing column, it should also be defined in the custom vardefs.
4. If the index is on a custom field, then it will be picked up automatically by the application AFTER you add the index in the database.
So there you go. Please remember that abstinence is key, but if you must… please practice safe index creation.