Table of Contents
This page describes two tools which are - apart from their input file format - nearly identical: root2sql
and fits2sql
.
Purpose
Both file types, ROOT and FITS files, have storage engines for the storage of table like data. In ROOT files these are trees with branches and leaves and in FITS files these are FITS tables, the so-called extensions. Although, root offers a lot of printing and fitting options, sometimes the access through a database has advantages. Therefore, a tool which maps ROOT-files and FITS-files into SQL tables helps.
Introduction
First get the man page of root2sql
either with root2sql --help
or (if installed) with man root2sql
. This page will mainly outline how root2sql
works, because fits2sql
is slightly less complex. In short, the differences will be explained at the end of the document.
Connection
How a connection to the database is established is outlined in: DatabaseBasedAnalysis/Connection. Per default, your connection will be compressed if the the host-address of the mysql-server is neither localhost nor 127.0.0.1. To force compression use
uri=fact:password@ihp-pc45.ethz.ch/factdata?compress=1
Generally, compression should be enable if connecting from the world. To test compression, the "Hello World! " example can be used. If you are using an ssh tunnel, make sure that you do not compress your stream twice!
In the following it is assumed that the database connection is working and tested, so that the resources (uri=...
) are omitted.
Basics
Generally, ROOT-files can store very complicated object as a TH1F. Although, they could be entered in a database as a so called BLOB (a binary object), this is not the idea of the tool. Thus only basic data types are supported.
Input
The input file is given as the first positional argument
root2sql [--file] inputfile.root
Each root tree has branches and leaves (the basic data types). These leaves can be read independently of the classes which were used to write the root file. The default tree to read from is Events but the name can be overwritten using -t
(an abbreviation of --tree
). Which would look like this
root2sql inputfile.root -t MyTree
Output
Table Name
The default table which is filled with the data in this tree is expected to have the same name than the tree. If this is not the case, it can be changed with --table
.
root2sql inputfile.root -t MyTree --table MyTable
In these three examples, the first one would read from a tree Events and write to table Events. The second case would read from a tree MyTree and write to a table MyTree and the third case would read from a tree MyTree and write to a table MyTable.
Column Names
The name of each column to which data is filled from a leave is obtained from the leaves' names. The leave names can be checked using --print-leaves
. A --print-branches
exists for convenience to print only the high-level branches. If you have not yet configured database access you can work with an empty uri --uri ""
If the leaf with the basic data type is called MTime.fTime.fMilliSec, the column would be called MTime.fTime.fMilliSec as well.
Both, root leaves and FITS-extension can contain fixes sized arrays of basic data types. If the array is called ARR and has 3 fields, the SQL columns will be called ARR[0], ARR[1] and ARR[2]. All modifier options working on columns always alter all corresponding columns at once.
Changing Column Names
Sometimes the leaves' names might be quite inconvenient like in the example above. To allow to simplify column names, regular expressions (using boost's regex) can be defined to change the names. A helpful resource for regex testing is https://www.regextester.com/. Note, that these regular expressions are applied one by one on each leaf's name. A valid expression could be:
--map=MHillas\.f/
which would remove all occurrences of MHillas.f.
--map MHillas\.f/MHillas_
would replace all occurrences of MHillas.f by MHillas_.
The corresponding call in C++ is boost::regex_replace(leafname, search, replace)
for --map search/replace
.
As the --map
option can be given more than once, many and complex replacement rules can be defined.
They are applied in sequence, one by one and to all leaves. A single match does not stop the sequence.
Skipping Columns
Sometimes it might also be convenient to skip a leaf. This can be done with the --ignore regex
option. If the given regular expression yields a match, the leaf will be ignored.
Note that the regular expression works on the raw-name of the leaf not the readily mapped SQL column names.
--ignore=ThetaSq\..*
will skip all leaves which name contains with 'ThetaSq.'. This option can be given more than once.
The corresponding C++ call is boost::regex_match(leafname, regex)
for --ignore regex
.
Column Data Types
The data type of each column is kept as close as possible to the leaves' basic data types. If for some reason this is not wanted, the data type of the SQL column can be overwritten with --sql-type column/type
. This works like the following:
--sql-type=FileId/UNSIGNED INT
In this example, FileId is the name of the SQL column for which the data type UNSIGNED INT should be used. The second column (UNSIGNED INT) is a basic SQL data type. The option can be given more than once.
Table Creation
It is not very handy if the SQL table has to be created manually beforehand. Therefore, root2sql
can automatically create the table. To do this, specify --create
. The query which would be used to create the table can be printed with --print-create
even when --create
has not been specified. Tables are created only if the do not yet exist (CREATE TABLE IF NOT EXISTS
). To ensure recreation of a table, the old table can be dropped with --drop
. As any insert operation would fail if the required table is dropped and no new one is created, --drop
always implies --create
.
All columns of the table are created as NOT NULL
as default.
Database Engine
The default database engine is to create the table with the default of the database. If a database engine (ENGINE=
) should be forced for the table the --engine ENGINE
option can be used. Similarly, the row-format (ROW_FORMAT=
) can be defined using --row-format FORMAT
. For example:
--engine InnoDB --row-format=COMPRESSED
Primary Keys
For convenience, a primary key can be defined by the the columns which belong to the primary key.
To choose the columns which should become primary keys, use --primary
, for example:
--primary=column1 --primary=column2
Note that the combination of these columns must be unique.
Duplicate Entries
Ignore
Usually, the INSERT
query would fail if another row with the same primary key exists already. This can be avoided adding the IGNORE
keyword to the INSERT
query by --ignore-errors
, which essentially ignores all errors and turns them into warnings which are printed after the query succeede.
Update
Another option is to update duplicate entries. This can be done using the ON DUPLICATE KEY UPDATE
directive. Giving -duplicate
, you can specify what should be updated in case of a duplicate key. To keep the row untouched, you can just update the primary key with the identical primary key, e.g.
--duplicate="MyPrimary=VALUES(MyPrimary)"
The --duplicate
resource can be specified more than once to add more expressions to the assignment_list. For more details, see the MySQL manual.
Constant value columns
Add a constant value
Sometimes it might be necessary to insert a constant value for each file which is processed (for example a file ID which is not part of the stored data). This can be achieved by the --fixed.COLUMN=VALUE
directive where COLUMN the column name and VALUE is the value filled into the column. As default, the column is create as INT UNSIGNED but this can be overwritten as usual with the --sql-type
directive.
A special case is when VALUE has the form /regex/[format]/
. In this case the filename is replaced given the regular expressen by the format using boost::regex_replace. Replacement uses the standard formatting rules to replace matches (those used by ECMAScript's replace method). If the format is omitted, the default is $0
which corresponds to just taking the match. For example, if
--const.mycolumn=/_([0-9]+)_/$1/
would be match the _1234_ in the filename file_12345_name.ext. It would be replaced by the first sub-sequence (%1) which is '12345'. ($0 is the whole match _12345_).
Conditional execution
Based on the constant values, conditional execution can be enable with --conditional
. After dropping/creating the table, it is checked by a SELECT
query whether at least one entry which fits the constant value columns exists. If at least one exists, further execution is stopped.
Pre-deleting entries
Usually the previously defined constant values are helpful to create an index which relates unambiguously the inserted data to the file. It might be useful to delete all data which belongs to this particular file before new data is entered. This can be achieved with the --delete
directive. It deletes all data from the table before inserting new data which fulfills the condition defined by the --const
directives.
Corresponding Index
To do the pre-deleting efficiently, it makes sense to have an INDEX created for a fast access. This can be achieved with the --index
directive which adds the corresponding statement to the table creation (CREATE TABLE
)
Splitting
Sometimes it is necessary to split the data into several root-trees or ascii-files, for exmple to produce a test and trainings sample. For this, two options exist --split-sequence
(shortcut -S
) and --split-quantile
(shortcut -Q
).
The first defines a fixed sequence, for example -S 2 -S 1 -S 4
will write the first two events (2) to the first tree/file, the third event (1) to the second tree/file and the fourth to seventh (4) events to the third tree/file. To split even and odd events into two trees/files you have to use -S 1 -S 1
.
To randomly split the data use quantiles. For example, -Q 0.5
splits the data equally into two samples, -Q 0.2, -Q 0.5 -Q 0.9
splits the data into four samples of 10%, 20%, 30% and 40%.
Note that splitting is defined on the rows which are received(!) from the database, i.e. before rows with NULL entries are excluded.
Debugging
For debugging purpose, or to just create or drop a table, the final insert query can be skipped using --no-insert
. All operation which would alter anything in the database are skipped using --dry-run
(note that this might create consequent errors).
Additional information about the connection are printed using --print-connection
. Note that this implies additional queries and more traffic and is therefore slightly slower.
Using a higher verbosity level (-v), an overview of the written columns or all processed leaves is printed depending on the verbosity level. The output looks like the following
Leaf name [root data type] (SQL name)
for example
MTime.fTime.fMilliSec [Long64_t] (MilliSec)
which means that the leaf MTime.fTime.fMilliSec is detected to be a Long64_t which is filled into a column called MilliSec. Leaves with non basic data types are ignored automatically and are marked as (-n/a-). User ignored columns are marked as (-ignored-).
The complete INSERT
query can be printed with --print-insert
. Note that this can be a quite long output.
For test purposes, the number of events and the events which are processed can be altered with --first
and --max
.
Performance
Note that for performance reason, all data is collected in memory and a single INSERT query is issued at the end.
In case of failure
If a query failed, the query is printed to stderr together with the error message. For the main INSERT query, this is only true if the verbosity level is at least 2 or the query has less than 80*25 bytes.
Returns
In case of success, 0 is returned, a value>0 otherwise.
fits2sql
The fits2sql
executable works mainly identical to the root2sql
executable, except that the ROOT-tree here is a so-called FITS-extension (table). Therefore, the --tree
option is called --extension
.
For debugging, --print-extension
shows the columns of a FITS extension and --print-columns
a list of all columns.
As a default, integer columns in FITS-files contain signed values. To convert them to unsigned columns, the --unsigned
option can be given follows by a column name. It can be given more than once to apply that to more than one column.
Example
For example, my resource file to convert ganymed files to SQL tables looks like this:
# Database uri and credentials #uri=... # Regular expression to match columns which shell be ignored ignore=ThetaSquared\..* ignore=Ghostbuster\..* ignore=MHillasSrc\..* ignore=MHillasSrcAnti\..* ignore=MSrcPosCam\..* ignore=MSrcPosAnti\..* ignore=DataType\..* ignore=.*\.fUniqueID ignore=.*\.fBits ignore=MPointingPos\.fHa # Regular expressions to simplify column names in table map=\.fVal/ map=MHillas\.f/ map=MHillasExt\.f/ map=MPointingPos\.f/ map=MImagePar\.f/ map=MNewImagePar\.f/ map=MTime\.fTime\.f/ map=MTime\.f/ # Overwrites the SQL data type for some columns sql-type=FileId/INT UNSIGNED sql-type=EvtNumber/INT UNSIGNED # Columsn to be setup as primary keys primary=FileId primary=EvtNumber