wiki:DatabaseBasedAnalysis/table2sql

Version 1 (modified by tbretz, 6 years ago) ( diff )

--

This page describes two tools which are - apart from their input file format - nearly identical: root2sql and fits2sql.

Purpose

Both file types, ROOT and FITS files, have storage engines for the storage of table like data. In ROOT files these are trees with branches and leaves and in FITS files these are FITS tables, the so-called extensions. Although, root offers a lot of printing and fitting options, sometimes the access through a database has advantages. Therefore, a tool which maps ROOT-files and FITS-files into SQL tables helps.

Introduction

First get the man page of root2sql either with root2sql --help or (if installed) with man root2sql. This page will mainly outline how root2sql works, because fits2sql is slightly less complex. In short, the differences will be explained at the end of the document.

Connection

How a connection to the database is established is outlined in: DatabaseBasedAnalysis/Connection. Per default, your connection will be compressed if the the host-address of the mysql-server is neither localhost nor 127.0.0.1. To force compression use

uri fact:password@ihp-pc45.ethz.ch/factdata?compress=1

Generally, compression should be enable if connecting from the world. To test compression, the "Hello World! " example can be used.

In the following it is assumed that the database connection is working and tested, so that the resources (uri=...) are omitted.

Basics

Root to SQL:

-u [ --uri ] arg Database link as in

user:password@server[:port]/database[?compress= 0|1].

--file arg The root file to read from --force Force processing even if there is no database

connection

--create Create the database if not existing --drop Drop the table (implies create) -t [ --tree ] arg (=Events) Name of the root tree to convert --table arg Name of the table to use (default is the tree

name)

--map arg A regular expression which is applied to the

leaf name befoee it is used as SQL column name)

--sql-type arg Allows to overwrite the calculated SQL type for

a given column e.g. 'sql-column-name/UNSIGNED IN'

--ignore arg Ignore the given leaf, if the given regular

expression matches

--primary arg List of columns to be used as primary keys

during table creation (in connection with --create)

--first arg (=0) First event to start with (default: 0), mainly

for test purpose

--max arg (=0) Maximum number of events to process (0: all),

mainly for test purpose

--engine arg Database engine to be used when a new table is

created

--row-format arg Defines the ROW_FORMAT keyword for table

creation

--duplicate arg Specifies an assignment_list for an 'ON

DUPLICATE KEY UPDATE' expression

--ignore-errors Adds the IGNORE keyword to the INSERT query

(turns errors into warnings, ignores rows with errors)

Debug options:

--no-insert Does not insert any data into the table --dry-run Skip any query which changes the databse (might

result in consecutive failures)

--print-connection Print database connection information --print-branches Print the branches found in the tree --print-leaves Print the leaves found in the tree (this is

what is processed)

--print-insert Print the INSERT query (note that it contains

all data)

--print-create Print the CREATE query -v [ --verbose ] arg (=1) Verbosity (0: quiet, 1: default, 2: more, 3,

Generally, ROOT-files can store very complicated object as a TH1F. Although, they could be entered in a database as a so called BLOB (a binary object), this is not the idea of the tool. Thus only basic data types are supported.

Input

The input file is given as the first positional argument

root2sql [--file] inputfile.root

Each root tree has branches and leaves (the basic data types). These leaves can be read independently of the classes which were used to write the root file. The default tree to read from is Events but the name can be overwritten using -t (an abbreviation of --tree). Which would look like this

root2sql inputfile.root -t MyTree

Output

Table Name

The default table which is filled with the data in this tree is expected to have the same name than the tree. If this is not the case, it can be changed with --table.

root2sql inputfile.root -t MyTree --table MyTable

In these three examples, the first one would read from a tree Events and write to table Events. The second case would read from a tree MyTree and write to a table MyTree and the third case would read from a tree MyTree and write to a table MyTable.

Column Names

The name of each column to which data is filled from a leave is obtained from the leaves' names. The leave names can be checked using --print-leaves. A --print-branches exists for convenience to print only the high-level branches.

If the leaf with the basic data type is called MTime.fTime.fMilliSec', the column would be called MTime.fTime.fMilliSec as well.

Changing Column Names

Sometimes the leaves' names might be quite inconvenient like in the example above. To allow to simplify column names, regular expressions (using boost's regex) can be defined to change the names. Note, that these regular expressions are applied one by one on each leaf's name. A valid expression could be:

--map=MHillas\.f/

which would remove all occurrences of MHillas.f.

--map MHillas\.f/MHillas_

would replace all occurrences of MHillas.f by MHillas_.

The corresponding call in C++ is boost::regex_replace(leafname, search, replace) for --map search/replace.

As the --map option can be given more than once, many and complex replacement rules can be defined.

They are applied in sequence, one by one and to all leaves. A single match does not stop the sequence.

Skipping Columns

Sometimes it might also be convenient to skip a leaf. This can be done with the --ignore regex option. If the given regular expression yields a match, the leaf will be ignored.

Note that the regular expression works on the raw-name of the leaf not the readily mapped SQL column names. Example:

--ignore=ThetaSq\..*

will skip all leaved which start with 'ThetaSq.'. This option can be usedmore than once.

Column Data Types

The data type of each column is kept as close as possible to the leaves' basic data types. If for some reason this is not wanted, the data type of the SQL column can be overwritten with --sql-type sql-column/sql-type. This works like the following:

--sql-type=FileId/UNSIGNED INT

while the first argument of the name of the SQL column to which the data type should be applied. The second column is the basic SQL data type. The option can be given more than once.

Changing column names

Sometimes the leaves' names might be quite inconvenient like MTime.fTime.fMilliSec or just MHillas.fWidth. To allow to simplify column names, regular expressions (using boost's regex) can be defined to change the names. Note, that these regular expressions are applied one by one on each leaf's name. A valid expression could be:

--map=MHillas\.f/

which would remove all occurances of 'MHillas.f'. This option can be used more than once. They are applied in sequence. A single match does not stop the sequence.

Sometimes it might also be convenient to skip a leaf. This can be done with the --ignore resource. If the given regular expresion yields a match, the leaf will be ignored. Note that the regular expression works on the raw-name of the leaf not the readily mapped SQL column names. Example:

--ignore=ThetaSq\..*

will skip all leaved which start with 'ThetaSq.'. This option can be usedmore than once.

The data type of each column is kept as close as possible to the leaves' data types. If for some reason this is not wanted, the data type of the SQL column can be overwritten with --sql-type sql-column/sql-ytpe, for example:

--sql-type=FileId/UNSIGNED INT

while the first argument of the name of the SQL column to which the data type should be applied. The second column is the basic SQL data type. The option can be given more than once.

Database interaction:

To drop an existing table, --drop can be used.

To create a table according to theSQL column names and data types, --create can be used. The query used can be printed with --print-create even --create has not been specified.

To choose the columns which should become primary keys, use --primary, for exmaple:

--primary=col1

To define more than one column as primary key, the option can be given more than once. Note that the combination of these columns must be unique.

All columns are created as NOT NULL as default. To force a database engine and/or a storage format, use --engine and --rot-format.

Usually, the INSERT query would fail if the PRIMARY key exists already. This can be avoided using the 'ON DUPLICATE KEY UPDATE' directive. With the --duplicate, you can specify what should be updated in case of a duplicate key. To keep the row untouched, you can just update the primary key with the identical primary key, e.g. --duplicate='MyPrimary=VALUES(MyPrimary)'. The --duplicate resource can be specified more than once to add more expressions to the assignment_list. For more details, see the MySQL manual.

For debugging purpose, or to just create or drop a table, the final insert query can be skipped using --no-insert. Note that for performance reason, all data is collected in memory and a single INSERT query is issued at the end.

Another possibility is to add the IGNORE keyword to the INSERT query by --ignore-errors, which essentially ignores all errors and turns them into warnings which are printed after the query succeeded.

Using a higher verbosity level (-v), an overview of the written columns or all processed leaves is printed depending on the verbosity level. The output looks like the following

Leaf name [root data type] (SQL name)

for example

MTime.fTime.fMilliSec [Long64_t] (MilliSec)

which means that the leaf MTime.fTime.fMilliSec is detected to be a Long64_t which is filled into a column called MilliSec. Leaves with non basic data types are ignored automatically and are marked as (-n/a-). User ignored columnd are marked as (-ignored-).

If a query failed, the query is printed to stderr together with the error message. For the main INSERT query, this is only true if the verbosity level is at least 2 or the query has less than 80*25 bytes.

In case of succes, 0 is returned, a value>0 otherwise.

Note: See TracWiki for help on using the wiki.