Accessing Guardium Big Data Intelligence Data Using SonarSQL

SonarSQL is part of Guardium Big Data Intelligence and provides SQL-based access to the Guardium data stored in the warehouse.

SonarSQL is a proxy that listens to MySQL protocol. It can be used with any modern MySQL client exactly the same way as MySQL server is used.

SonarSQL is a thin layer that translates MySQL queries to SonarW aggregation pipelines, submits the pipelines to the backend server, and translates the results back to the MySQL protocol. SonarSQL uses SonarW extensions to the Aggregation Framework such as joins and subqueries.

Configuration

In Guardium Big Data Intelligence the entire configuration of SonarSQL occurs during the installation of Guardium Big Data Intelligence. In some cases you may need to reconfigure settings directly within SonarSQL (e.g. if making a mistake during the installation).

The entire configuration for SonarSQL is given in the command line. Use --help to get information about the available options.

The Debian package also installs a configuration file that is being read by the start scripts to use the proper command line values. To configure SonarSQL in Ubunutu/Debian, run dpkg-reconfigure sonrasql, and you’ll be asked about the most important configuration questions.

If you’d like more control, you can configure the system further by editing the Debian configuration file which is located at /etc/default/sonarsql. You’ll need to restart the service after modifying the file:

sudo service sonarsql restart

Authentication

SonarSQL verifies users’ passwords, connects as the same user to SonarW, and allows privileged users to create users.

Connecting as root

When you install |product|, you will be asked for the SonarSQL root password. This is NOT your machine root password, but the one used by the SonarSQL root user). This is the privileged user that can create new user accounts in SonarSQL (and therefore also in SonarW).

The root SonarSQL password is hashed and stored in the config file.

You can modify the root password at any time by issuing:

sudo dpkg-reconfigure sonarsql

You can also leave the password empty, which will disable the option of logging in as root. The root password will be hashed and only the hash is saved in the configuration file.

After creating the root password, you can log in as root using your MySQL client and the password you provided.

Creating users

When logged in as root, you can create new users in your MySQL client using:

CREATE USER '<user>' IDENTIFIED BY PASSWORD '<password>'

This will create a user in both SonarSQL and in SonarW, with the same password. All the user details and password hashes are saved within SonarW in the system.users collection within the admin database. Only hashes are saved, and not the clear text password.

After you’ve created users, you can log in using MySQL client using the newly created username and password. After verifying the password, SonarSQL will authenticate its session to SonarW as the same user that logged in to SonarSQL. This ensures end-to-end user session.

Managing users

Fine grained permissions can be set by manipulating the roles array for each user. This is done directly with SonarW using the standard MongoDB tools or the shell, and takes effect immediately.

If you want to modify or re-create SonarW users, you need to do that using SonarW API or MongoDB shell (dropUser, updateUser). The changes take effect immediately in SonarSQL. Currently users are created with readWriteAnyDatabase role.

Gathering information

Version Number

Check the version:

sonarsql --version

Log Files

Log file is located in /var/log/sonarsql and contain useful information to detect errors. When contacting Guardium Big Data Intelligence support with an issue please also include the query you tried to execute.

Pipeline Information

If you’d like to know what pipeline is being generated from the query you are using, you can use EXPLAIN in your MySQL client:

MySQL> EXPLAIN SELECT name FROM ghosts WHERE points>3 \G
*************************** 1. row *************************** Collection:
ghosts
  Pipeline: db.ghosts.aggregate([
{
    "$match" : {
        "points" : {
            "$gt" : 3
        }
    }
}, {
    "$project" : {
        "_0000" : "$name", "_id" : 0
    }
}]) 1 row in set (0.00 sec)

sql2pipeline is a standalone tool that translates the pipeline without connecting to SonarW at all. It is convenient to use when one wants to quickly check generated pipelines without actually running them, or needing to create the underlying structure of collections. To run, just execute sql2pipelene and supply the SQL query as the single parameter. Make sure it is quoted properly:

$ sql2pipeline 'select name from ghosts where points between 2 and 10'
db.ghosts.aggregate([ {
    "$match" : {
        "$and" : [ { "points" : { "$gte" : 2 } }, { "points" : { "$lte" : 10 } } ]
    }
}, {
    "$project" : {
        "_0000" : "$name", "_id" : 0
    }
}])

The pipelines are printed in a way that you can copy and paste them directly into MongoDB shell if you’d need to see the actual results returned from SonarW.

Note that since sql2pipeline does not communicate with SonarW, it has no knowledge of metadata, and thus cannot handle wild cards in fields name like SELECT * FROM...

Supported SQL Statements

SonarSQL is not ANSI-SQL compliant nor covers the entirety of the MySQL dialect. It only covers those operators and capabilities of the SonarW warehouse. SonarSQL only supports SELECT commands - no DML commands are supported and apart from the mentioned CREATE USER command no DDL commands are supported.

Within SELECT, most constructs are supported including joins. However, subqueries are generally not supported apart from IN constructs within the WHERE clause (e.g. SELECT …. WHERE X IN (SELECT I from J)).

Additionally, because the underlying server is a SonarW warehouse the semantics of queries conform to those of SonarW which may be different than MySQL. The most fundamental difference is that since SonarW supports flexible data, operations must be lenient on data that may not conform to an explicit schema.

Integrating External RDBMS Data in Guardium Big Data Intelligence

You can import data from a relational database into Guardium Big Data Intelligence and then use it for joins and queries. Data is brought in by issuing queries through the Guardium Big Data Intelligence scheduling/dispatch mechanism. You can set up external queries that are invoked periodically (on a schedule) and that update collections in Guardium Big Data Intelligence.

To configure this integration you need to do two things:

  1. Add a section to dispatcher.conf that includes the information on how to connect to the RDBMS, what query to issue to retrieve the data, and how to insert or update the information in Guardium Big Data Intelligence, e.g.:

    [RDBMS-mssql]
    # RDBMS details
    DBType=MS-SQL
    DB_Address=<ip-address/DNS-name of rdbms server>
    DB_Port=1433
    SQL_DB=db_name
    UserName=user_name
    Password=user_password
    SQL_Query=select id, host, port from table1
    # SQL_Query_tail=SELECT id, host, port FROM table1 WHERE 1=1 AND column1 >
    {$$COLUMN1__previous} AND column1 <= {$$COLUMN1__current}
    # SQL_Query_tail_no_previous=SELECT id, host, port FROM table1 WHERE 1=1
    AND column1 <= {$$COLUMN1__current}
    max_query_cycles=1000000
    
    # Sonar credentials for the target update db
    sonar_db=rdbms_db
    sonar_coll=rdbms_coll
    sonar_username=sonar_user
    sonar_password=sonar_pwd
    sonar_keyfile=
    sonar_authdb=sonar_authdb
    
    # JSON query from Sonar
    q={"_id": "$$id"}
    u={"host": "$$host", "port": "$$port"}
    d=
    upsert=true
    multi=true
    

Note: You can use DB authentication as well as windows authentication. In the case of Windows authentication you need to provide “<domain>\<user_name>” in the “UserName” field.

  1. Use the handle in the config file to schedule a job. In the example above enter RDBMS-mssql in the RDBMS Target section as shown below.

If a “SQL_Query” field exists and has a value, the dispatcher will use it and will perform a full (not incremental) update.

Supported DBType(s) are: MS-SQL, Oracle, Redshift and MySQL; for additional database connectivity contact your Guardium Big Data Intelligence account manager.

_images/rdbms_import.png

Tailing an RDBMS Table

Guardium Big Data Intelligence also supports incremental updates (a.k.a. tailing) from the RDBMS. This can be achieved by using the “SQL_Query_tail” & “SQL_Query_tail_no_previous” field(s) in the relevant RDBMS section in the dispatcher.conf file.

Note: for the dispatcher to perform the “tail” operation the “SQL_Query” field must be empty (e.g. SQL_Query=)

In these fields you need to set a SQL query that will only return the “additional” data not imported previously. The syntax to be used is to append condition(s) so that only “new” records will be returned from the RDBMS.

The format for setting the condition is by adding the field-name to the sql query followed by “__previous” or “__current” sufix surrounded by a pair of curly brackets and pre-fixed with double dollar sign (e.g {$$<FieldName>__previous} and {$$<FileName>__current}).

The dispatcher will replace the “{$$<FieldName>__previous}” and “{$$<FileName>__current}” with the appropriate values:

  • {$$COLUMN1_previous} - dispatcher will retrieve the previous value of “COLUMN1” stored in SonarW during the last successful update
  • {$$COLUMN1_current} - dispatcher will retrieve the current max value of the column “COLUMN1” from the RDBMS

The “SQL_Query_tail” query will be used when data has already been imported, and “SQL_Query_tail_no_previous” will be used in cases where “previous” values are not available.

For example, setting Guardium Big Data Intelligence to tail the data with new insertions made on the RDBMS, by setting a condition on “COLUMN1” field:

SQL_Query_tail=SELECT id, host, port FROM table1 WHERE COLUMN1 > {$$COLUMN1__previous} AND COLUMN1 <= {$$COLUMN1__current}

Once a successful update is performed, Guardium Big Data Intelligence will keep track of the maximum value of each of the specified condition fields. The next time the synchronization job runs, Guardium Big Data Intelligence will only import new values that are returned from the RDBMS query.

If there are no previous values, dispatcher will use the “SQL_Query_tail_no_previous” query, for example:

SQL_Query_tail_no_previous=SELECT id, host, port FROM table1 WHERE COLUMN1 <= {$$COLUMN1__current}

You can also specify multiple conditions. For example, if you want to import records that have a higher “ID” and have a later “DATE”, you can append both fields to your query as shown below:

SQL_Query_tail=SELECT id, host, port FROM table1 WHERE ID > {$$ID__previous} AND ID <= {$$ID__current} AND DATE >= TO_DATE(‘{$$DATE__previous}’, ‘YYYY-MM-DD hh24:mi:ss’) AND DATE < TO_DATE(‘{$$DATE__current}’, ‘YYYY-MM-DD hh24:mi:ss’)

Only records matching all conditions will be imported from the RBDMS to Guardium Big Data Intelligence.

Type conversions for SQL tail

Some data types require a conversion for them to be used in the “where” clause of the “Tail” Query.

For example:

Oracle

Date:

TO_DATE('{$$DATE__previous}', 'YYYY-MM-DD hh24:mi:ss')

MS-SQL

Date:

convert(datetime2, '{$$DATE__previous}', 21)

Binary:

cast('{$$column_binary1__previous}' as varbinary(max))

Redshift

Date:

TO_DATE('{$$DATE__previous}', 'YYYY-MM-DD HH24:MI:SS:US')

Note: When using date fields for the tail condition it is safer to change the condition rules to be date >= previous and date < current

For example (MS-SQL):

where DATE >= convert(datetime2, ‘{$$DATE__previous}’, 21) and DATE < convert(datetime2, ‘{$$DATE__current}’, 21)

Integrating with Oracle

In order to integrate with Oracle you need to install Oracle packages on the Guardium Big Data Intelligence server.

The required packages are:

oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
oracle-instantclient12.1-odbc-12.1.0.2.0-1.x86_64.rpm

Both can be downloaded from:

http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html

Once the packlages are downloaded to the server, open terminal and go to the download folder.

Then run the next commands:

sudo rpm -Uvh oracle-instantclient12.1-basic-12.1.0.2.0-1.x86_64.rpm
sudo rpm -Uvh oracle-instantclient12.1-odbc-12.1.0.2.0-1.x86_64.rpm

Note: the system is configured to use version 12.1 if you need to use a different version, please Consult your Guardium Big Data Intelligence account manager

Intedrating with Amazon Redshift

In order to integrate with Amazon Redshift you need to do the following steps:

  1. Install ODBC Driver, it can be downloaded and installed from:

https://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-linux.html

Look for a 64-bit RPM package.

Note: Driver version must be 1.4.3.

  1. Add ‘Amazon Redshift (x64)’ section in /etc/odbcinst.ini:
sudo vi /etc/odbcinst.ini

Append (rhel7):

[Amazon Redshift (x64)]
Description=Amazon Redshift ODBC Driver (64-bit)
Driver=/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
  1. Configure Redshift section in dispatcher.conf, see previous similar examples for other RDBMS types.

Note: the section should declare:

DBType = Redshift