Guardium Big Data Intelligence Trusted Connections Management

The Guardium Big Data Intelligence Profiling Engine includes an end-to-end implementation of trusted connections. The Guardium Big Data Intelligence Trusted Connections (TC) application allows you to implement a closed-loop system where new connections are discovered by Guardium Big Data Intelligence, routed to an individual for a decision (or use an algorithm that you define for auto-approval). A connection is added to a trusted connection list based on the decision and automatically updates a Guardium tuple group for use in a policy.

Guardium Big Data Intelligence supports four modes as to how a connection is marked to be trusted or not as shown by the figure below:

  1. Decisions on whether a new connection should be trusted or not is made by the owner of the database server or by an admin. New connections are routed through the Guardium Big Data Intelligence Justify (workflow) application.
  2. Tagging of new connections is done by the Guardium administrator using a spreadsheet-based interface.
  3. Determination of whether a new connection is trusted or not is made by a set of rules. An example could be that any connection where the client IP is equal to the server IP can be trusted.
  4. Determination is done by a machine learning model. In order to create the model use the Re/Build button. Note that this requires existing data the model will learn from.

To implement a TC process using Guardium Big Data Intelligence you first have to make six decisions which affect the setup within the system (All this setup is performed in the configuration of the SAGE Profiling engine):

  1. Will you be using a workflow-based process, automated approval using a set of rules, will the admin mark the trusted connections in bulk or will the machine learning model make these decisions for you? Configure this using the “Accept New TCs Using” radio buttons. To mark TCs in bulk the Guardium Big Data Intelligence admin clicks on the “Download Spreadsheet to Approve” link and gets an Excel sheet. This sheet has all new unknown connections and has a “Trusted?” column. Enter Y to make the connection as trusted or N to mark it as not trusted. Then upload this spreadsheet using the “Upload Updates” button. Selecting “By Spreadsheet with Model Recommendation” adds column with the model’s recommendation to the spreadsheet.
  2. When using workflow, will owners mark their servers’ connections as trusted or will an admin mark them so (using workflow). If you use owner justification you need to load a collection to the lmrm__ae database with documents that have a field called “Server_IP”, a field called “owner” which has the username of the person logging into Guardium Big Data Intelligence and a field called “email” for notification. Once you load and automate the population of this collection, enter the collection name in the “Server-to-Owner Mapping Collection” field. Any connection to a server for which an owner cannot be found will be routed to the Guardium Big Data Intelligence admin user for approval.
  3. Will you be using 5-tuples or 7-tuples within the Guardium policies to denote TCs? This affects the entire process as well as the grdapi calls that are made. While you may switch between the two setups note that all approvals will start from scratch and the system will not use 5-tuple approval for 7-tuple approvals and vice versa. Therefore, it is recommended to consider both options and decide which is preferred before starting the implementation. Note also the the group names used are different - one is “Guardium Big Data Intelligence Trusted Connections 5-Tuple” and the other is “Guardium Big Data Intelligence Trusted Connections 7-Tuple”. Finally, if you switch between the two modes remember to change the pre and post scripts for the grdapi calls to use the appropriate group names.
  4. What is the threshold for being considered a trusted connections. Since TC white lists are mostly used for filtering chatty data, the threshold value is used to exclude connections used more than the threshold number of times per cycle (typically a day).
  5. How long before requiring re-approval on a TC. If you leave this empty a TC will never expire. Otherwise, enter a number in days - a TC will remain trusted only for this number of days since it was first approved, after which requiring a new approval.
  6. How long before expiry should the system start asking you for re-approval.

Once you turn on the profiling engine, check the Trusted Connections checkbox and fill in the setup values, the system will start generating and managing trusted connections on an incremental basis. I.e. a connection will never be routed for re-approval once it has been marked and only new connections will require attention. The cycle for TCs is daily - i.e. once a day new connections are discovered, routed and added to the TC list when approved. The cycle for a grdpi update is also daily. All these can be changed by changing the job schedules but if you later change the config then schedules will revert back to the default daily cycle.

The final setup required is in dispatcher.conf to allow for the grdapi calls to be done from Guardium Big Data Intelligence to the Guardium CM. In dispatcher.conf create a section called TrustedConnectionsGrdapi like so:

copy_host=<Guardium cm>
copy_username=<Guardium cli login>
copy_password=<Guardium cli password>

You then need to populate TrustedConnectionsGrdapiPre and TrustedConnectionsGrdapiPost with the appropriate lines depending on whether you use 5-tuples or 7-tuples and depending on what your policy is called. The Pre file should have a grdapi call to delete all members in either the “SonarG Trusted Connections 5-Tuple” group or the “SonarG Trusted Connections 7-Tuple” group. The Post file needs to reinstall the policy that is used this group for filtering.

NOTE: If the trusted connection has a blank for one of the fields it will be converted to a % in the tuple - allowing you to filter even when Guardium may have omited a value.

NOTE: The Trust field must be exactly N, Y or ? with no padded white characters.

Machine Learning Capability

If many connections that have been classified as trusted or untrusted already exist, a machine learning algorithm can be used to learn the underlying rules deciding classification (Re/Build button). The result of this process is a decision tree that can be viewed (View Model button) and applied to new connections that have not yet been classified as trusted or untrusted. Using this model can be done in three possible ways (all of which found under “Accept New Tcs Using”):

  1. By Machine Learning Model - once a day the model will classify all connections that have not yet been classified according to its rule system and write the results directly in the relevant collection.
  2. By Spreadhseet with Model Recommendation - Each new connection will still be marked using a spreadsheet but the model recommendation will be added to either help the user or to validate that the model’s decision aligns with the human’s decision.
  3. By Workflow and Model Recommendation - TBD

Machine Learning Configurations

This section details advanced machine learning options configured in /etc/sonar/sage.conf

Algorithm section:

Configuring this section determines which algorithm is used to create the decision making model.

Parameters section:

Configuring this section determines the attributes of the decision making model.

  1. criterion - measure of the quality of each split.
  2. splitter - strategy to choose the split at each node.
  3. max_depth - maximum depth of the tree.
  4. min_samples_split - minimum samples required to split a node.
  5. min_samples_leaf - minimum samples required to be at a leaf.
  6. min_weight_fraction_leaf - minimum fraction of weight required to be at a leaf.
  7. max_features - number of features considered when looking for best split.
  8. random_state - determines random number generator.
  9. max_leaf_nodes - maximum number of leaves.
  10. min_impurity_decrease - minimum decrease in impurity required to split a node.
  11. class_weight - weight for each class for classification.
  12. presort - sort the data before fitting or not.
  13. n_estimators - number of trees in the forest.
  14. bootstrap - are bootstrap samples used when building trees.
  15. oob_score - use out of bag samples to estimate accuracy.
  16. n_jobs - number of jobs to run in parallel.
  17. verbose - verbosity of tree building process.
  18. warm_start - new model or reuse solution for previous model and add more estimators.
  19. base_estimator - base classifier to use for ensemble.
  20. learning_rate - diminishes contribution of each tree.
  21. algorithm - boosting algorithm when using ada_boost.
  22. max_samples - number of samples to use for each tree’s training.
  23. bootstrap_features - whether or not features are drawn with replacement.
  24. loss - function be optimized.
  25. subsample - fraction of the samples used to train each estimator.
  26. init - estimator used to compute initial predictions.
  27. contamination - amount of contamination in the dataset.