wiki:DatabaseBasedAnalysis

Version 52 (modified by tbretz, 6 years ago) ( diff )

--

Purpose

The idea is to write all image parameters for data and Monte Carlo files into a database, so that the access to the data becomes more easy. To some extend (also not necessarily very efficient) the database server can already do the analysis. Everybody is free to use the programming language or tool of their choice to do the analysis. Writing the data to be analysed into a root file has the advantage that no additional software installation other than root itself is required. However, be aware that the files can easily become large.

Status

Most of the data is in the database. Due to some technical reasons still about 1/5th of the data is missing. Most of it from 2016 (seems 2016 the data was more noisy). No automatic process is currently updating the tables and no process is replicating the existing data from La Palma.

Basics

Although this page will give some basic examples how to implement an analysis, this is neither an analysis tutorial nor a reference guide to SQL or the required tools. For this, please refer to their manuals. Here are a few examples:

Connecting to the Database

DatabaseBasedAnalysis/Connection

Tools to access the database

mysql

The mysql client is the classical way to connect to the database and get data. Data is retrieved from a MySQl database always as pure ascii. mysql can print the ascii table to the console either as a nicely formatted text or as a machine readable ascii table.

rootifysql

A convenient way to retrieve data is the rootifysql tool which is part of the FACT++ package. As the name suggests, it writes the data into root files (but can also write the data into ascii files as the mysql-client). More details can be found either calling it with the --help option or at https://www.fact-project.org/logbook/showthread.php?tid=4192.

To build only the FACT++ tools (fitsdump, root2sql, fits2sql, rootifysql) you can use cmake -DTOOLS_ONLY=1. You can create a debian and package and a tar file then with make package. You will certainly need the development packages of curses, libnova, libmysql-client, libmysql++ and libopenssl. In addition, you need help2man and groff.

Some more detailed instructions can be found at DatabaseBasedAnalysis/rootifysql.

Other alternatives

Many possibilities exist to access a mysql database as a C API, MySQL++, Python (MySQL.Connector), PHP and others. You are free to use whatever tool you like. In the following, an analysis will be outlined using the rootifysql client and because it is most convenient.

PhpMyAdmin

To get a fast glimpse on the accessible databases and the contents of the tables, you can log-in to PhpMyAdmin at http://iph-pc45.ethz.ch/phpmyadmin

The Analyis

Data Selection

For data selection only run-wise information should be relevant which are stored in the table RunInfo. The reason is that if you select data on a more fine grained level (e.g. event-wise zenith angle), right now there is no easy method to determine the corresponding observation time. So whenever data is selected event-wise make sure that you do not cut the data in a variable which cuts out events systematically and not randomly. For example, an event-wise cut in zenith angel usually keeps or discards two consecutive events because their zenith angle is correlated. For a cut in any image parameter (Width, Length, Size, ...), the result on two consecutive events is random because their image parameters are not correlated.

As an example we analyse the Crab data from our public sample (01/11/2013 - 06/11/2013).

Let's first have a look at the total observation time of all Crab data in this period:

SELECT 
   COUNT(*), 
   SUM(TIME_TO_SEC(TIMEDIFF(fRunStop,fRunStart))*fEffectiveOn/3600) AS EffOnTime, 
   MIN(fZenithDistanceMin) AS MinZd, 
   MAX(fZenithDistanceMax) AS MaxZd,
   MIN(fR750Cor/fR750Ref)  AS MinQ,
   MAX(fR750Cor/fR750Ref)  AS MaxQ
FROM 
   RunInfo 
WHERE 
   fSourceKey=5 
AND 
   fRunTypeKey=1 
AND 
   FileId BETWEEN 131101000 AND 131107000

The result using mysql is

+----------+-------------+-------+-------+---------+---------+
| COUNT(*) | EffOnTime   | MinZd | MaxZd | MinQ    | MaxQ    |
+----------+-------------+-------+-------+---------+---------+
|      435 | 32.53992354 |  6.36 | 67.89 | 0.01477 | 1.10584 |
+----------+-------------+-------+-------+---------+---------+
1 row in set (0.01 sec)

So we have 435 data runs from Crab with a total effective observation time of 32.5 hours in a zenith angle range between 6° and 68° and a bad weather factor between 0.01 (really bad) to 1.1 (extremely good).

Taking only good data by adding "AND fR750Cor/fR750Ref>0.9" to the WHERE-clause gives us

+----------+-------------+-------+-------+---------+---------+
| COUNT(*) | EffOnTime   | ZdMin | ZdMax | MinQ    | MaxQ    |
+----------+-------------+-------+-------+---------+---------+
|      328 | 24.58955887 |  6.36 | 67.86 | 0.90084 | 1.10584 |
+----------+-------------+-------+-------+---------+---------+
1 row in set (0.00 sec)

But we also want to restrict ourselves to "good" zenith angles (zenith angles at which there is no significant efficiency loss). So we add "AND fZenithDistanceMax<35" to the WHERE-clause which yields

+----------+-------------+-------+-------+---------+---------+
| COUNT(*) | EffOnTime   | ZdMin | ZdMax | MinQ    | MaxQ    |
+----------+-------------+-------+-------+---------+---------+
|      244 | 19.06608557 |  6.36 | 34.90 | 0.90084 | 1.10584 |
+----------+-------------+-------+-------+---------+---------+
1 row in set (0.00 sec)

Now we need to get a list of these runs with

SELECT 
   FileId
FROM 
   RunInfo 
WHERE 
   fSourceKey=5 
AND 
   fRunTypeKey=1 
AND 
   FileId BETWEEN 131101000 AND 131107000
AND 
   fR750Ref/fR750Cor>0.9
AND 
   fZenithDistanceMax<35

This can later be JOINed with the following queries.

Let's write the list of runs into a file. There are plenty of options. Here are a few (assuming the query is in a file query.sql)

mysql -N [...] < query.sql > Crab.txt
cat query.sql | mysql -N [...] > Crab.txt
rootifysql [...] -n -v0 -d query.sql > Crab.txt
rootifysql [...] -n -w Crab.txt query.sql

The [...] is placeholder for additional options, in particular the login credentials (ideally they are kept in a file which can not be read by everyone).

An alternative is to put

#!/path/to/rootifysql --config=/path/to/resources.rc

in the first line of your query.sql. Make it executable "chmod u+x query.sql" and put your credentials (uri=) into resources.rc. Now you can call it directly

./query.sql -n -w Crab.txt

Data statistics

Now we want to get some statistics about the Crab data between 01/11/2013 and 06/11/2013 and see if we can do simple plots. For this we write the table into a root file.

rootifysql [...] --query " \
   SELECT               \
      *                 \
   FROM                 \
      RunInfo           \
   WHERE                \
      fSourceKey=5      \
   AND                  \
      fRunTypeKey=1     \
   AND                  \
      FileId BETWEEN 131101000 AND 131107000 \
   AND                  \
      fR750Ref/fR750Cor>0.9 \
   AND                  \
      fZenithDistanceMax<35 \
"

If the file already exists, either give it a different name (see --help for details) or overwrite it with --force.

The output should look similar to this

Reading global  options from 'fact++.rc'.
Reading default options from 'rootifysql.rc'.

------------------------ Rootify SQL -------------------------
Connecting to database...
Client Version: 5.7.23
Server Version: 5.7.23-0ubuntu0.18.04.1
Requesting data...
Opening file 'rootify.root' [compression=1]...
Trying to setup 120 branches...
Configured 115 branches.
Filling branches...
317 rows fetched.
317 rows skipped due to NULL field.
0 rows filled into tree.
10 kB written to disk.
File closed.
Execution time: 0.0537751s (169.6 us/row)
--------------------------------------------------------------

Per default rows which contain any NULL are not written to the file because all values are converted to a DOUBLE and there is no representation for a NULL-value in double. So, we need to force the output with --ignore-null and will get something like:

Reading global  options from 'fact++.rc'.
Reading default options from 'rootifysql.rc'.

------------------------ Rootify SQL -------------------------
Connecting to database...
Client Version: 5.7.23
Server Version: 5.7.23-0ubuntu0.18.04.1
Requesting data...
Opening file 'rootify.root' [compression=1]...
Trying to setup 120 branches...
Configured 115 branches.
Filling branches...
317 rows fetched.
317 rows filled into tree.
86 kB written to disk.
File closed.
Execution time: 0.072247s (227.9 us/row)
--------------------------------------------------------------

Now we can open the file in root and do plots. The easiest ist to use the tree viewer:

root rootify.root
root [0] 
Attaching file rootify.root as _file0...
root [1] TTree *T = (TTree*)_file0->Get("Result");
root [2] T->StartViewer();
root [3]

Or plot the zenith angle distribution directly:

root rootify.root
root [0] 
Attaching file rootify.root as _file0...
root [1] TTree *T = (TTree*)_file0->Get("Result");
root [2] T->Draw("fZenithDistanceMean");
root [3]

or zenith angle versus time (Hint: DATETIME columns are converted to Unix-time in seconds):

root rootify.root
root [0] 
Attaching file rootify.root as _file0...
root [1] TTree *T = (TTree*)_file0->Get("Result");
root [2] T->Draw("fZenithDistanceMean:fRunStart");
root [3]

Data retrieval

The events themselves are stored in a table named Events. The position of the source in camera coordinates is stored in Position. To get them, you can run the following query

SELECT
   Events.*,
   Position.X,
   Position.Y
FROM RunInfo
LEFT JOIN Events   USING (FileId)
LEFT JOIN Position USING (FileId, EvtNumber)
WHERE 
   fSourceKey=5 
AND 
   fRunTypeKey=1 
AND 
   FileId BETWEEN 131101000 AND 131107000
AND 
   fZenithDistanceMax<35
AND 
   fR750Ref/fR750Cor>0.9

or with the list you wrote before

SELECT
   Events.*,
   Position.X,
   Position.Y
LEFT JOIN Events   USING (FileId)
LEFT JOIN Position USING (FileId, EvtNumber)
WHERE 
   FileId IN ($MyList)

using --list.MyList=Crab.txt as command-line option to rootifysql. Both queries are similar in execution time.

Let's assume the output file is crab-data-only.root (rootifysql --out=crab-data-only.root). Requesting the data and writing the file took me about 60s.

To run an analysis on the data you can use the following root macro "ganymed.C". It produces a theta-square plot. Its execution took about 5s (root ganymed.C++)

#include <iostream>

#include <TMath.h>
#include <TH1.h>
#include <TChain.h>
#include <TStopwatch.h>

void ganymed()
{
    // Create chain for the tree Result
    // This is just easier than using TFile/TTree
    TChain c("Result");

    // Add the input file to the
    c.AddFile("crab-data-only.root");

    // Define variables for all leaves to be accessed
    // By definition rootifysql writes only doubles
    double X, Y, MeanX, MeanY, Width, Length, CosDelta, SinDelta,
        M3Long, SlopeLong, Leakage1, SlopeSpreadWeighted, Size,
        ConcCore, ConcCOG, NumIslands, NumUsedPixels;

    // Connect the variables to the cordesponding leaves
    c.SetBranchAddress("X", &X);
    c.SetBranchAddress("Y", &Y);
    c.SetBranchAddress("MeanX", &MeanX);
    c.SetBranchAddress("MeanY", &MeanY);
    c.SetBranchAddress("Width", &Width);
    c.SetBranchAddress("Length", &Length);
    c.SetBranchAddress("CosDelta", &CosDelta);
    c.SetBranchAddress("SinDelta", &SinDelta);
    c.SetBranchAddress("M3Long", &M3Long);
    c.SetBranchAddress("SlopeLong", &SlopeLong);
    c.SetBranchAddress("Leakage1", &Leakage1);
    c.SetBranchAddress("NumIslands", &NumIslands);
    c.SetBranchAddress("NumUsedPixels", &NumUsedPixels);
    c.SetBranchAddress("SlopeSpreadWeighted", &SlopeSpreadWeighted);
    c.SetBranchAddress("Size", &Size);
    c.SetBranchAddress("ConcCOG", &ConcCOG);
    c.SetBranchAddress("ConcCore", &ConcCore);

    // Set some constants (they could be included in the database
    // in the future)
    double mm2deg = +0.0117193246260285378;
    double abberation = 1.02;

    // -------------------- Source dependent parameter calculation -------------------

    // Create a histogram for on- and off-data
    TH1F hon("on",   "", 55, 0, 1);
    TH1F hoff("off", "", 55, 0, 1);

    // Loop over all events
    TStopwatch clock;
    for (int i=0; i<c.GetEntries(); i++)
    {
        // read the i-th event from the file
        c.GetEntry(i);

        // First calculate all cuts to speedup the analysis
        double area = TMath::Pi()*Width*Length;

        bool cutq = NumIslands<3.5 && NumUsedPixels>5.5 && Leakage1<0.1;

        bool cut0 = log10(area)<log10(Size)*898-1535;

        if (!cutq || !cut0)
            continue;

        // Loop over all wobble positions in the camera
        for (int angle=0; angle<360; angle+=60)
        {
            // -------------------- Source dependent parameter calculation -------------------

            double cr = cos(angle*TMath::DegToRad());
            double sr = sin(angle*TMath::DegToRad());

            double px = cr*X-sr*Y;
            double py = cr*Y+sr*X;

            double dx = MeanX - px/abberation;
            double dy = MeanY - py/abberation;

            double dist = sqrt(dx*dx + dy*dy);

            double alpha = asin((CosDelta*dy - SinDelta*dx)/dist);
            double sgn   = TMath::Sign(1., (CosDelta*dx + SinDelta*dy)/dist);

            // ------------------------------- Application ----------------------------------

            double m3l   = M3Long*sgn*mm2deg;
            double slope = SlopeLong*sgn/mm2deg;

            // --------------------------------- Analysis -----------------------------------

            double xi = 1.39252 + 0.154247*slope + 1.67972 *(1-1/(1+4.86232*Leakage1));

            double sign1 = m3l+0.07;
            double sign2 = (dist*mm2deg-0.5)*7.2-slope;

            double disp  = (sign1<0 || sign2<0 ? -xi : xi)*(1-Width/Length)/mm2deg;

            double thetasq = (disp*disp + dist*dist - 2*disp*dist*cos(alpha))*mm2deg*mm2deg;

            // Fill the on- and off-histograms
            if (angle==0)
                hon.Fill(thetasq);
            else
                hoff.Fill(thetasq, 1./5);
        }
    }
    clock.Print();

    // Plot the result
    hon.SetMinimum(0);
    hon.DrawCopy();
    hoff.DrawCopy("same");
}

You can of course include all the calculations into your query already

SELECT
   Events.*,
   Angle,
   weight,
   @PX      := cosa*X - sina*Y,
   @PY      := cosa*Y + sina*X,
   @DX      := MeanX-@PX/1.02,
   @DY      := MeanY-@PY/1.02,
   @Norm    := SQRT(@DX*@DX + @DY*@DY),
   @Dist    := @Norm*0.0117193246260285378 AS Dist,
   PI()*Width*Length*0.0117193246260285378*0.0117193246260285378 AS Area,
   @LX      := TRUNCATE((CosDelta*@DY - SinDelta*@DX)/@Norm, 6),
   @LY      := TRUNCATE((CosDelta*@DX + SinDelta*@DY)/@Norm, 6),
   @Alpha   := ASIN(@LX) AS Alpha,
   @Sign    := SIGN(@LY) AS Sign,
   @M3L     := M3Long*@Sign*0.0117193246260285378,
   @Slope   := SlopeLong*@Sign/0.0117193246260285378 AS Slope,
   @Xi      := 1.39 + 0.154*@Slope + 1.679*(1-1/(1+4.86*Leakage1)),
   @Sign1   := @M3L+0.07,
   @Sign2   := (@Dist-0.5)*7.2-@Slope,
   @Disp    := IF (SIGN(@Sign1)<0 || SIGN(@Sign2)<0, -@Xi, @Xi) * (1-Width/Length),
   @ThetaSq := (@Disp*@Disp + @Dist*@Dist - 2*@Disp*@Dist*SQRT(1-@LX*@LX)) AS ThetaSq
FROM RunInfo
LEFT JOIN Events   USING (FileId)
LEFT JOIN Position USING (FileId, EvtNumber)
CROSS JOIN
(
   SELECT   0 AS Angle UNION ALL
   SELECT  60 AS Angle UNION ALL
   SELECT 120 AS Angle UNION ALL
   SELECT 180 AS Angle UNION ALL
   SELECT 240 AS Angle UNION ALL
   SELECT 300 AS Angle
) Wobble
WHERE 
   fSourceKey=5 
AND 
   fRunTypeKey=1 
AND 
   FileId BETWEEN 131101000 AND 131107000
AND 
   fZenithDistanceMax<35
AND 
   fR750Ref/fR750Cor>0.9

Or you use the existing table for the standard 60° Wobble positions and do just CROSS JOIN Wobble.

This will give you all you need in crab.root (rootifysql --out=crab.root), but significantly increases computing time and the output file will be about six times larger.

A simple macro just applying all the cuts would then be enough to do a theta-square plot

void ganymed3()
{
    // Create chain for the tree Result
    // This is just easier than using TFile/TTree
    TChain c("Result");

    // Add the input file to the
    c.AddFile("crab.root");

    // Set some constants (they could be included in the database
    // in the future)
    c.SetAlias("mm2deg", "+0.0117193246260285378");

    // Define the cuts
    c.SetAlias("CutQ", "NumIslands<3.5 && NumUsedPixels>5.5 && Leakage1<0.1");

    c.SetAlias("Cut0", "log10(Area)<log10(Size)*898-1535");

    // Do one plot for each wobble position
    c.Draw("ThetaSq", "(ThetaSq<1 && CutQ && Cut0 && Angle==0)*( weight)");
    c.Draw("ThetaSq", "(ThetaSq<1 && CutQ && Cut0 && Angle!=0)*(-weight)", "same");
}

Combining everything into a single query is a bit tricky but works:

SELECT
    Counter.*,
    `Signal` - `Background`/5      AS `Excess`,
    LiMa(`Signal`, `Background`/5) AS `Significance`
FROM
(

    SELECT
        @S := COUNT(IF(Weight>0, 1, NULL)) AS `Signal`,
        @B := COUNT(IF(Weight<0, 1, NULL)) AS `Background`
    FROM
    (

        SELECT
            Size,
            NumUsedPixels,
            NumIslands,
            Leakage1,
            Weight,
            PI()*Width*Length AS Area,
            @PX      := cosa*X - sina*Y,
            @PY      := cosa*Y + sina*X,
            @DX      := MeanX-@PX/1.02,
            @DY      := MeanY-@PY/1.02,
            @Norm    := SQRT(@DX*@DX + @DY*@DY),
            @Dist    := @Norm*0.0117193246260285378 AS Dist,
            @LX      := TRUNCATE((CosDelta*@DY - SinDelta*@DX)/@Norm, 6),
            @LY      := TRUNCATE((CosDelta*@DX + SinDelta*@DY)/@Norm, 6),
            @Alpha   := ASIN(@LX) AS Alpha,
            @Sign    := SIGN(@LY) AS Sign,
            @M3L     := M3Long*@Sign*0.0117193246260285378,
            @Slope   := SlopeLong*@Sign/0.0117193246260285378 AS Slope,
            @Xi      := 1.39252 + 0.154247*@Slope + 1.67972*(1-1/(1+4.86232*Leakage1)),
            @Sign1   := @M3L+0.07,
            @Sign2   := (@Dist-0.5)*7.2-@Slope,
            @Disp    := IF (SIGN(@Sign1)<0 || SIGN(@Sign2)<0, -@Xi, @Xi) * (1-Width/Length),
            @ThetaSq := (@Disp*@Disp + @Dist*@Dist - 2*@Disp*@Dist*SQRT(1-@LX*@LX)) AS ThetaSq
        FROM RunInfo
        LEFT JOIN Events   USING (FileId)
        LEFT JOIN Position USING (FileId, EvtNumber)
        CROSS JOIN Wobble
        WHERE
            fSourceKey=5
        AND
            fRunTypeKey=1
        AND
            FileId BETWEEN 131101000 AND 131107000
        AND
            fZenithDistanceMax<35
        AND
             fR750Cor>0.9*fR750Ref
    ) TableAlias

    WHERE
        ThetaSq<0.024
    AND
        Area < LOG10(Size)*898-1535
    AND
        NumUsedPixels>5.5
    AND
        NumIslands<3.5
    AND
        Leakage1<0.1

) Counter

The output in mysql looks something like:

+--------+------------+----------+-------------------+
| Signal | Background | Excess   | Significance      |
+--------+------------+----------+-------------------+
|    984 |       2205 | 543.0000 | 19.72239008502298 |
+--------+------------+----------+-------------------+
1 row in set (2 min 5.48 sec)

Or like this if you finish the query with "\G" instead of a semicolon:

*************************** 1. row ***************************
      Signal: 984
  Background: 2205
      Excess: 543.0000
Significance: 19.72239008502298
1 row in set (2 min 0.02 sec)

I am sure there is also a query which in addition prints the effective on-time.

Note: See TracWiki for help on using the wiki.