How to perform Data Anonymization for Sugar development

edyaninet —  May 2, 2016 — 11 Comments

Here is another guest post from Emmanuel Dyan from the Elite SugarCRM Partner iNET Process. In it he addresses a common Sugar project requirement using an open source tool developed by iNET Process.

The problem that we will try to solve with this post is:

How do we make sure that we are never developing using actual customer data but, at the same time, work with data that reflects reality?

Data Anonymization

Usually, when we work on a customization project for a customer you have a minimum of 3 different environments: Development,  User Acceptance Testing (UAT), and Production. To make sure that we work in optimal and consistent conditions, we usually copy the database from one environment to another (preferably from production to other environments). Doing this type of manipulation has multiple drawbacks, including:

  • We have to collect a Database Dump which means that it contains unencrypted raw data. What would happen if we mistakenly expose this dump to someone who is unauthorized?
  • We have to test some functionality to make sure that it works. What would happen if we test a Campaign that sends thousand of e-mails … to the … actual customers of our customer?

Anonymizing the data is the best practice to avoid “playing” with customer data and to keep their trust in us.

The challenge with anonymizing data is figuring out how to overwrite the data with something that is completely unrecognizable. For example: “John Doe” will become “voluptatem accusantium“. His job title becomes “doloremque” and his country “magnam“. His phone number will become “569898520114457878744778” instead of “+123456789“.

Anonymization Tool

Now how do we work in realistic conditions with that kind of anonymization? Indeed, we need another solution that works with any Sugar instance. The solution we are demonstrating below is implemented in our open source CLI tool (sugarcli).

The anonymization architecture follows:

  • It uses another independent tool, called “neuralyzer” (by iNet Process) that is a command line tool to anonymize a database.
  • Neuralyzer uses a library called “Faker” to generate realistic data.
  • It is composed of two subcommands:
    • A configuration generator.  This is used to generate a configuration file automatically without destroying the system tables (config, relationships, etc).
    • The anonymizer that connects directly to the Sugar Database to perform the job. It uses iNET Process libsugarcrm to performs SQL queries (via a PDO connection). It also purges deleted records and cleans _cstm tables of deleted records. It finishes by then emptying all _audit tables.

Configuration Generator

The configuration generator reads all the Sugar database tables and tries to guess, from the field type or the field name what type of fake data needs to be generated. For example a field containing the string city will receive a random city.  If a field contains _user_id then it will be ignored in order to preserve unique IDs.

To use it, first download SugarCLI from https://github.com/inetprocess/sugarcli.

Then run:

./sugarcli.phar anonymize:config --path <sugarcrm_path>

The parameters are the following:

    --file=FILE                   Path to the configuration file [default: "../db/anonymization.yml"]
    --ignore-table=IGNORE-TABLE   Table to ignore. Can be repeated (multiple values allowed)
    --ignore-field=IGNORE-FIELD   Field to ignore. Can be repeated (multiple values allowed)
-p, --path=PATH                   Path to SugarCRM installation.

Example

The command ./sugarcli.phar anonymize:config creates a file that looks like:

guesser_version: 1.0.0
 entities:
     accounts:
         cols:
             name: { method: company }
             description: { method: sentence, params: [20] }
             facebook: { method: url }
             twitter: { method: url }
             googleplus: { method: url }
             account_type: { method: randomElement, params: [['', Analyst, Competitor, Customer, Integrator]] }
             industry: { method: randomElement, params: [['', Apparel, Banking, Biotechnology, Chemicals]] }
             annual_revenue: { method: randomNumber, params: [4] }
             phone_fax: { method: phoneNumber }
             billing_address_street: { method: streetAddress }
             billing_address_city: { method: city }
             billing_address_state: { method: state }
             billing_address_postalcode: { method: postcode }
             billing_address_country: { method: country }
             rating: { method: sentence, params: [8] }
             phone_office: { method: phoneNumber }
             phone_alternate: { method: phoneNumber }
             website: { method: url }

As you can see, the commands uses different methods to guess the type of faker methods to use:

  • If it is a dropdown then it gets the list of values from vardefs
  • If the field contains a known word it uses a pre-defined method (example .*_city = city)
  • Otherwise it will use the dbType (varchar = sentence)

You can change the content of the file once generated to match your criteria.

Run the anonymization

Warning! Do not run this on a production instance! This command will overwrite data in the target database. 

./sugarcli.phar anonymize:run --path <sugarcrm_path>

The parameters are the following:

    --file=FILE        Path to the configuration file [default: "../db/anonymization.yml"]
    --force            Run the queries
    --remove-deleted   Remove all records with deleted = 1. Won't be launched if --force is not set
    --clean-cstm       Clean all records in _cstm that are not in the main table. Won't be launched if --force is not set
    --sql              Display the SQL of UPDATE queries
    --table=TABLE      Anonymize only that table (repeat for multiple values) (multiple values allowed)
-p, --path=PATH        Path to SugarCRM installation.

 

Example

The command ./sugarcli.phar anonymize:run –table=accounts –force gives the following output:

Be careful, the anonymization is going to start
 That will overwrite every data in the Database !
If you are sure, please type "yes" in uppercase
 YES
Anonymizing accounts
50/50 [============================] 100%
Emptying accounts_audit
Emptying bugs_audit
Emptying campaigns_audit
Emptying cases_audit
Emptying contacts_audit
Emptying contracts_audit
....
Done in 0.42 sec (consuming 40.5Mb)


Now you can dump your database and send it to your development team! If you use the parameters –remove-deleted and –clean-cstm, it will be smaller too.

11 responses to How to perform Data Anonymization for Sugar development

  1. 

    Guys, when i download and install both cli, and the sugarcli.phar from github,
    I get an “Command “anonymize” is not defined” Error.
    I tried to compile sugarcli.phar, but it didn’T work.
    Any clues?

  2. 

    Hello, I tried to run this command : ./sugarcli.phar anonymize:config –path /var/www/html/sugarcrm but I got this error:
    =====================================================================
    PHP Fatal error: Can’t inherit abstract function Symfony\Component\Config\Resource\ResourceInterface::isFresh() (previously declared abstract in Symfony\Component\Config\Resource\SelfCheckingResourceInterface) in phar:///root/sugarcli.phar/vendor/symfony/config/Resource/SelfCheckingResourceInterface.php on line 21

    [RuntimeException]
    exit() or die() called before the end of the command.

    anonymize:config [–file FILE] [–ignore-table IGNORE-TABLE] [–ignore-field IGNORE-FIELD] [-p|–path PATH]
    =====================================================================

    Any idea?

    I use SugarCRM 6.5 and PHP 5.3.3

  3. 

    Hi,

    As indicated there: https://github.com/symfony/config/blob/2.8/composer.json

    Symfony 2.8 is not compatible with PHP 5.3.3. You should upgrade your version, sugar is compatible with all 5.3 versions and the latest is 5.3.29. The great majority of modern Opensource components need at least the 5.3.9.

    Thanks

  4. 

    When trying anonymize:config I get:

    PHP Warning: Invalid argument supplied for foreach() in phar:///home/rafal/docker/finalUpgrade/data/web/sugarcli.phar/vendor/inetprocess/libsugarcrm/src/Bean.php on line 620

    [Inet\Neuralyzer\Exception\InetAnonGuesserException]
    Can’t guess the type timestamp

    Any ideas?

  5. 

    Hi rszczypka and sorry for the delay, could you download the most recent version of sugarcli and try again ?

    Thanks

  6. 

    This is really impressive. My question (not having tested this out yet): how does it know which tables are “system-data-level” versus which ones are “business-data-level”? For example, what part of the config generator knows how to properly handle the ‘config’ table or the ‘user_preferences’ table, etc? I ask not only out of general curiosity, but also because I’m wondering how “intelligent” this part of the process is, eg “future-proof”. If Sugar adds in a new system-level table (like metadata_cache was added in 7.6 or field_value_cache in 7.7) or removes a once-essential table (like ‘relationships’ was removed in 7.7), is this something that the sugarcli tool (or config-generator) will handle gracefully, or if that’s not reasonable to expect (too magical), then are such changes something that can be easily “massaged” into working by making version-specific tweaks to the generated anonymize:config file (as needed)?

    I would very much like to dive into playing with this tool and–if its as awesome as it sounds like it might be–try to get it folded into Sugar’s KB article that overlaps with this topic:

    https://support.sugarcrm.com/Knowledge_Base/Platform_Management/Providing_a_Backup_Without_Sensitive_Data/

    But as you can see in the above article, the challenge is ensuring that certain tables be left alone and intact to preserve parity from the source instance, while other tables might still need to be mashed up a bit to ensure the anonymized instance is neutered from certain external behavior (like sending out a pending campaign, firing off workflow alerts, pulling in emails from external accounts and thus marking the messages as read, etc).

    Again, very pleased with this idea/tool as a concept and hope it meets the needs any developer or support agent would need to request a sanitized yet functionally-equivalent version of a client instance.

  7. 

    Hi Anthony,

    Thanks for your message.

    We know the “system-data-level” tables by experience, but the list could evolve as you have noticed and we have no warranty that our tool will not break something on future versions. But the config generator is there only to help not to provide a final config file that you’ll use blindly. It helps because with dozens of fields in dozens of tables, it’s almost impossible to generate the config yourself.

    Also, there are some fields that you don’t want to anonymize sometimes, because it helps the developers. In that case you need to edit and change the config.

    Conclusion: read the config and adapt it to your needs 🙂

    Thanks for your link, that’ll help us to make sure that we maintain our tables list up to date ! If you need help please post a message there, we’ll be pleased to answer.

    Have a nice week end
    Emmanuel

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s