GBDI 4.2

Ingesting Non-Guardium Data

The GBDI ETL processes can ingest CSV/JSON data from the incoming data folder, regardless of whether it comes from Guardium collectors or other sources. This flexible ingestion model allows GBDI to be used as a Security Data Lake for miscellaneous data, thus providing an excellent source for compliance reporting and/or security analytics in a Guardium deployment.

There are two ways to ingest non-Guardium CSVs: make them "look like" Guardium extracts, or use the misc-files (a.k.a. miscellaneous) feature of the GBDI ETL. You can ingest JSON data with the "miscellaneous" mechanism.


To process a miscellaneous CSV/JSON (thereby inserting its data into a collection within the sonargd database), create an entry under misc-files within /etc/sonar/sonargd.conf. The CSV/JSON will be processed from your regular incoming directory.


   - match: StreetTree([a-zA-z0-9]*)
     collection: trees
   - match: Regis([a-zA-z0-9]*)
     collection: animals

The first match states that if a filename within the incoming folder is a regex match for StreetTree([a-zA-z0-9]*) sonargd will attempt to ingest the data. As an example, StreetTrees_Kerrisdale.csv would regex match StreetTree([a-zA-z0-9]*) The data will be inserted into the trees collection under the sonargd database.

The second match requires that a filename is a regex match for Regis([a-zA-z0-9]*), it will be processed into a collection called animals.

The GBDI Source field is taken from the section of the filename matched in the ( ) parenthesis. If the file looks like StreetTree_gmac03 then the GBDI Source field will have _gmac03 as a value.

By default, a match attempts to use the sonargdm plugin during ingestion. To use a different plugin - create a matching entry under the plugins section of sonargd.conf. Below, the trees collection does have a match under Plugins, and will be ingested using the basic plugin.


To ingest a miscellaneous CSV, using a different plugin than sonargdm, an entry in the above misc-files section must have a corresponding entry under the plugins section of /etc/sonar/sonargd.conf:

Builtin Plugins:
 sonargdm: The internal processing (meant mostly for Guardium originated data). 
           Specify the binary if different than /usr/bin/sonargdm
 basic: write each line as-is to the collection. 
       By default assumes utf-8. Specify 'encoding' if different (standard python encoding)
 upsert: handle upserts. The parameters specify the key. Use encoding as above.
 i_dam: write both to full_sql and instance. Use encoding as above.
 oracle: for oracle xml.
 sonargateway: process the file by running sonargateway binary directly.

The following example includes both the default sonargdm, as well as a match for trees, which corresponds to the above example in the misc-files section. The plugin used for trees will be basic, forcing the data to be inserted 'as-is' into the collection trees:


Note that there is no entry called animals; ergo, the second entry from the above misc-files example will attempt to be ingested using the sonargdm plugin. You can use the sonargateway to handle miscellaneous JSONs. In order to ingest JSON data by sonargateway plugin, you need to create an entry in the misc-files section, and define the paths of the sonargateway binary and the json configuration file under the plugins section in /etc/sonar/sonargd.conf.

For example:

      binary: /usr/local/bin/sonargateway-new
      config: /etc/sonar/gateway/network-flow-config.json

Here is an example of a sonargateway json configuration file, directing JSON documents to the instance_coll collection in the sonargd database:

    "global_settings": {
      "sonar_URI": "mongodb://CN=admin@localhost:27117/admin? ... ",
      "target_db": "sonargd"
    "output_connection": {
      "event_format": {
        "standard": "JSON"
      "default_collection": "instance_coll",
      "unique_label": "instance_coll",
      "group_label": "instance_coll"

For more information about sonargateway configuration, see


Rather than waiting for a COMPLETE file, a misc file is processed only when it has not been updated for a certain (configurable) period of time. This is to prevent a file from being processed while it is still being copied. To configure the wait time before ingestion starts edit:

min-age: 7

in sonargd.conf

Following Guardium DMv1 Conventions

CSVs should conform to certain rules in order to be properly ingested in terms of both format and file naming conventions:

  1. CSVs must include a header line, e.g.:

    $ cat EXP_TRUSTED_CONNECTIONS_20160429100000.csv
  2. CSVs must be comma separated. Strings must be encompassed in double quotes. Typing can be controlled through configuration in the sonargd.conf file but otherwise will create strings for all data.

  3. CSV file names must start with EXP_ as shown above.

  4. One or more CSV of the same type and timeframe may be packaged together. CSVs much be packaged and compressed using tar -czf (i.e. a compressed tar) and the file must have a .gz extension, for example:

    $ ls -lrt *TRUST*
    -rw-rw-r-- 1 ubuntu ubuntu 1185 Apr 29 20:53 1762144738_host1_EXP_TRUSTED_CONNECTIONS_20160429100000.gz
    -rw-r--r-- 1 ubuntu ubuntu      0 Apr 29 21:09 1762144738_host1_EXP_TRUSTED_CONNECTIONS_20160429100000_COMPLETE.gz

    The first number represents a unique source machine identifier, the name after the _ is the hostname of the source machine, then the name of the data domain and then the timeframe. Note that the name of the file determines which collection the data will be inserted into. This can be changed and managed by editing sonargd.conf but as a default the filename determines where the data will go.

  5. A COMPLETE file must always accompany a data file by the same name. The content of the COMPLETE file is not used and can it be zero length. A data file will not be processed until the accompanied COMPLETE file arrives; this is how the ETL process knows the copy of the data file has completed.