Guardium Big Data Intelligence Analytic Engines (SAGEs)

There are four analytics engines within Guardium Big Data Intelligence:

  • The Noise Canceling Engine
  • The Profiling Engine
  • The Machine Learning Engine
  • The Trusted Connections Engine

There are six additional services within Guardium Big Data Intelligence:

  • The Security 360 service including both DB 360 and VA 360
  • The GDPR Scanning service
  • The Risk Management service
  • The Workflow service
  • The Integrity service
  • The Syslog service

All analytic engines and services make use of Guardium Big Data Intelligence’s ability to run complex data analysis on large amounts of data in a short time. Each engine/service can be enabled separately. Each of the engines has a slightly different purpose and uses slightly different algorithms:

  • The Noise Canceling Engine uses reduce algorithms to compute signatures that summarize verbose data generated by DAM systems and enables SIEM and other downstream systems to receive and inspect highly optimized data sets tuned for information density and size, making the data more usable and less expensive to manage.
  • The Profiling Engine inspects connection information and creates a profile used for identifying new connections and for implementing trusted connection profiling as a built-in feature of Guardium Big Data Intelligence. The profiling engine also looks at moving averages and alerts you of large percentage deltas. You can enable moving average analytics for sessions, exceptions, violations, queries and full sql records. These analytics can help you discover when certain traffic suddenly drops significantly and/or when a policy change is causing too many records to be captured by Guardium. Moving average analytics run once a day and look at 35 days back. Only the same day of the week is used - e.g. if today is a Wednesday then today is compared with an average derived from the last five Wednesdays. The session moving average analyticts also alerts you when a certain traffic pattern has stopped completely (e.g. some KTAP driver stopped collecting data). The session alert will also include the collector which which this traffic was last heard of, the other collectors from which data arrived, and whether there are any inactive STAPs on this collector. All moving average analytics results are delivered via email. Finally, the profiling engine also performs analysis of captured sessions to alert you when data should be captured but is not.
  • The Machine Learning Engine applies advanced analytic and machine learning techniques to identify outliers and anomalous behavior that can be inspected without viewing detailed raw data.
  • The Trusted Connections engine allow you to categorize connections as trusted or not using workflow (usually to the app owner) and then build a machine learning model that can further classify new incoming connections.
  • The DB 360 service takes different data sets such as audit data, vulnerability assessment (VA) data and classification data and creates a single holistic view of each database in terms of how protected and compliant it is.
  • The VA 360 service takes data from multiple VA scanners and merges them into a single view of ulnerabilities, risk and aging data.
  • The GDPR scanning engine allows you to look for private data in the various feeds coming from Guardium. Specifically, you can look at the full sql data set, the policy violation data set and the classifier data set - all data sets that can easily contain private information. You can then search for names, addresses, IP addresses, phone numbers or any regular expression of value set that you upload into Guardium Big Data Intelligence. Searches can be precise or fuzzy. Outputs can be sent or recorded for further review with the values seen to be private or only with the IDs of the original data set. You can also use the GDPR engine on foreign data sets - i.e. feed data into Guardium Big Data Intelligence through any mechanism and use the GDPR engine on that data set.
  • The integrity service can be used to upload hashes of the data to the jSonar cloud. There is no sensitive data in the hashes and they can only be used for verification that the data has not be tampered with.
  • The Risk Management Service allows you to route discovered vulnerabilities and scans that uncover sensitive data to the owner for remediation or for marking an exception/false positive.

Configuring the SAGEs

Configure the SAGE engines using the Guardium Big Data Intelligence Oprations GUI as shown below:


Configure the SAGE services using the Guardium Big Data Intelligence Oprations GUI as shown below:


The default state for all engines/services is off and you can choose to enable any of them or all of them. The engines require minimal configuration settings in order to operate and can be easily configured using the above screen. The first three engines can work on connection (session) data, violation data, exception data or any combination of the DAM activity categories. The Noise Canceling Engine can also work on the raw query information which tends to be far more verbose than the other three data categories.

The Machine Learning Engine configuration panel is shown below:


While the engine runs continuously, the computation of outliers based on historical information runs in cycles every X hours. The recommended default is a daily cycle. When historical outliners are computed the algorithm considers the number of preceding days as the history window. Since outliers are based on statistical analysis, you control how many outliers are propagated for human inspection using three controls:

  • Zscore which is a metric of how many standard deviations an outlier is from the mean (i.e. how far “towards the tail”)
  • How many outliers to emit (after sorting by the zscore)
  • How far back to ignore an outlier (older anomalies may have less value to you)

Configuring Syslog

Guardium Big Data Intelligence data can be sent to downstream systems such as a SIEM over syslog or as CSV exports. When using Syslog you can send any data to syslog using a scheuler, specifying syslog as the copy-to target. The scheduler will compute the data set and stream JSON records over syslog. Configure the remote syslog to forward the records to your SIEM.

Additionally, you can configure specific data sets to be sent to your SIEM in the SAGE screen as shown above. The data sets you can send are:

  • Exceptions - Raw data as coming from the DAM system
  • Policy Violations - Raw data as coming from the DAM system
  • Queries - Raw data as coming from the DAM system
  • Signatures - Output from the Noise Canceling Engine
  • Outliers - Output from the Machine Learning Engine
  • Profiles - Output from the Profiling Engine

When sending data to syslog you can choose whether to output the data as JSON records, using the CEF format, the LEEF format or the RSA format for Guardium alerts. Raw data is sent over syslog every 10 minutes and processed data every 20 minutes.

Using SAGE Data

SAGE data can be used within Guardium Big Data Intelligence, exported to other systems or a combination of the two. All SAGE data is persisted within Guardium Big Data Intelligence and predefined reports and dashboards are available from within the Guardium Big Data Intelligence Predefined menu items.

SAGE data can be exported over syslog as mentioned above, and can be accessed through the REST API layer using the “Copy Link” or scheduling link in the appropriate predefined report.

SAGE data can be directly accessed from within Splunk. Click on the Splunk icon in the configuration screen for the definitions that should be added to your indexes.conf file as shown below:


Finally, you may use Tableau, Qlik or any other Business Intelligence (BI) tool to access SAGE data. To access data connect to the Guardium Big Data Intelligence host on port 3306, the sonargd database, and the following tables:

  • ae_dt for signature data for connections, exceptions and violations.
  • ae_dt_queries for query signature data
  • ae_pr for connection profiles
  • ae_dt_actor for outliers

The following image shows access to SAGE profile data by a Tableau user.


Feeding SAGE Data Back to Guardium

SAGE data can be useful to drive analytics-based monitoring. For example, SAGE profiles can be used to add members to a tuple group that is used within a Guardium policy and SAGE outlier data can be used to implement an adaptive policy where monitoring is done more closely for users that are “misbehaving”.

Like all Guardium Big Data Intelligence data, SAGE data is accessible using any number of methods and tools. This section describes two simple ways to get SAGE data back into a Guardium environment using SonarSQL. SonarSQL is a mysql-compatible layer allowing you to access Guardium Big Data Intelligence data using standard SQL statements. Both these methods are easy to use and easy to automate. There are additional methods relying on the Guardium Big Data Intelligence NoSQL toolkit - consult your Guardium Big Data Intelligence account manager for more information.

Using SonarSQL to generate grdapi scripts

Guardium allows you to add group members using the add_member_to_group_by_desc grdapi call (see

Using SonarSQL you can extract data from the Guardium Big Data Intelligence profiling collection and generate group members. As an example, the ae_pr collection in the lmrm__ae database has records of the form:

        "_id" : ObjectId("57041f5e0f037d57fcc36092"),
        "DB_User_Name" : "",
        "Server_IP" : "",
        "Client_IP" : "",
        "Type" : "",
        "Service_Name" : "",
        "OS_User" : "",
        "Client_Host_Name" : ""

and the ae_pr_new has records of the form:

        "_id" : ObjectId("5702a2db302d906600478fe1"),
        "DB_User_Name" : "hacker2",
        "Server_IP" : "",
        "Client_IP" : "",
        "Day" : ISODate("2016-04-04T00:00:00Z"),
        "Hour" : 11,
        "count" : 48,
        "Failed" : 48,
        "Type" : "exception",
        "Exception_Description" : "ORA-00955"

To use ae_pr to populate group members you can run the following from a script or command line, generating a file, and then using an ssh cli connection to run this on your Guardium appliance:

mysql -uqa -p<pwd here> lmrm__ae -e “select concat(‘grdapi add_member_to_group_by_desc desc=”MyGroup” member=”’ , Server_IP,’+’,Service_Name,’+’,DB_User_Name,’”’) from ae_pr limit 10” > tt.txt

Generated lines look like:

grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""
grdapi add_member_to_group_by_desc desc="jS" member=""

Using SonarSQL to Generate a Custom Table

You can use Guardium Custom Tables to extract data from Guardium Big Data Intelligence / SonarSQL and create a table within Guardium with the Guardium Big Data Intelligence data. First, create a custom table definition:


Then upload data using the desired SQL query and a MySQL datasource:


How the SAGEs Work

Each of the SAGE engines works slightly differently, as described below.

The Noise Canceling Engine (NCE) operates as an analytic reduction machine that evaluates very large volumes of raw data and reduces these into a much smaller set of signatures with counters using various analytical dimensions. Built in reducers are included for connection data, query data, exceptions and policy violations. Additional reducers may be added and existing reducers may be changed in terms of which dimensions are used and their associated parameters.

The Profiling Engine (PE) focuses on database connections and is primarily used for implementing Trusted Connection analysis. Data defining trusted connections can be uploaded into the system or the profiling engine itself can be used to automatically catalog all connections. Each time a new connection is observed it is both flagged as a “new connection” ( to be inspected and verified) and added to the connection profile as a “learned connection” to avoid being repeatedly flagged and to reduce the amount of manual work involved in managing the connection profiles.

The Machine Learning Engine (MLE) learns user behavior and persistently evaluates this behavior to identify anomalies or outliers. Factors considered include user connections, user exceptions, violations attributed to users or any combination of these parameters. The MLE compares user behavior based on the following dimensions:

  • Behavior of different users on the same database instance and determining if any user’s actions as compared to the norm are an outlier (user to population).
  • Evaluating behavior within a user category, i.e., comparing each DBA to other DBA’s but not to other categories of users such as functional accounts. This feature requires additional categorization data defining which category each user belongs to and this data is populated in the ae_category collection (user to category).
  • Behavior of users over time, i.e., seeing if any user’s behaviors change over time. A common outlier scenario occurs when a user has decided to leave the company and may exhibit unusual data access patterns (user to user).

The sensitivity of any of the three engines can be adjusted; for example flagging an outlier as two standard deviations from the mean as opposed to four standard deviations will normally produce a larger number of outliers. It is recommended to begin with a higher sensitivity threshold ( the default value is a zscore of 3) and lower the threshold as more data is collected and statistical outliers become more predictable. Since Guardium Big Data Intelligence is capable of maintaining large raw data sets for extended periods of time the MLE analysis will become very precise over time.

The MLE has two modes when it comes to user categories. By default categories are defined manually - i.e. the ae_category collection in the lmrm__ae database is populated usually using an extract from an external system (such as a CMDB). ae_category specifies documents of the form:

  "category" : "PRIVILEGED",
  "DB_User_Name" : "SYS"

The actual strings used for category do not matter - they are merely used to group users into categories in order to know which users need to be compared together.

MLE has an auto-catgeorization engine that uses a clustering algorithm that analyzes what users have been doing and groups them together based on query information. When you turn clustering on, you need to configure whether clustering occurs by COMMANDS commonality, by OBJECTS commonality or both. Additionally, you provide a set of Guardium COMMANDS and OBJECTS group names that are used as the clustering dimensions.

Think of this as a matrix of indicator variables. First you specify which command and/or object groups are indicative enough to serve for clustering. Name them and load them into a collection. This is used to compute a set of “bits” per user (did the user perform a command mentioned in one of the command groups or touch an object mentioned in one of the object groups). Each category is therefore a specific bit vector based on these indicators.

To configure/select the groups populate a collection in lmrm__scheduler that has documents of the form:

MLE also has an auto-catgeorization engine that uses a clustering algorithm that analyzes what users have been doing and groups them together based on query information. When you turn clustering on, you need to configure whether clustering occurs by COMMANDS commonality, by OBJECTS commonality or both:

   "name": "__ae_dt_actions",
   "type" : "\"COMMANDS\"" or "\"OBJECTS\"",
   "desc": "\"<group name>\""

You can click on the hyperlink in the config lightbox to get an Excel template for populating values and import this data into a collection using the Spreadsheets application in JSON Studio. As an example, upload the following spreadsheet to have the engine categorize based on DDL, DML and Administrative commands and based on administration and sensitive objects:


Additional Engines/Services

There are two additional services that you can enable/disable using the SAGE configuration screen:

  1. The management hierarchy service uses data from LDAP (or other source) for manager-managed relationship and builds a full management hierarchy that is then used for the Justify application. This allows a manager to see all tasks assigned to anyone in their management chain.
  2. The Guardium Big Data Intelligence Integrity Service allows you to upload hash signatures for future validation that data has not been tampered with. any Guardium Big Data Intelligence system can maintain hash signatures for the data. This is enabled by a configuration parameter in /etc/sonar/sonard.conf. By default this maintains a data set in a different location (on the same system). You can then validate that data has not changed by comparing it to the hash signature using the integrity_check runCommand. This is all available internally. In addition, if you enable the Guardium Big Data Intelligence Integrity Service, the hashes will be sent over email to Guardium Big Data Intelligence and maintained within a Google Vault environment for safe-keeping. This provides an additional copy and Guardium Big Data Intelligence will provide validation services per need. The data itself holds no sensitive information (just a GUUID for a block of data and the hash value) but it is further encrypted before emailed to Guardium Big Data Intelligence.