Changeset 19796


Ignore:
Timestamp:
10/26/19 12:22:24 (5 years ago)
Author:
tbretz
Message:
Updated help text and improved the handling of the update option (can not be combined with the split rules if tre already exists), make tree the third positional argument.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/FACT++/src/csv2root.cc

    r19793 r19796  
    3131        ("tree,t",         var<string>("Events"),     "Name of the root tree to convert")
    3232        ("compression,c",  var<uint16_t>(1),          "zlib compression level for the root file")
    33         ("no-header",      po_switch(),               "Use if the first line contains no header")
     33        ("no-header,n",    po_switch(),               "Use if the first line contains no header")
    3434        ("dry-run",        po_switch(),               "Do not create or manipulate any output file")
    3535        ;
     
    5050    p.add("file", 1); // All positional options
    5151    p.add("out",  1); // All positional options
     52    p.add("tree", 1); // All positional options
    5253
    5354    conf.AddOptions(control);
     
    6061{
    6162    cout <<
    62         "csv2root - Reads data from a root tree and writes a csv file\n"
     63        "csv2root - Converts a data table from a csv file to a root tree\n"
    6364        "\n"
    6465        "For convenience, this documentation uses the extended version of the options, "
    6566        "refer to the output below to get the abbreviations.\n"
    6667        "\n"
    67         "This is a general purpose tool to fill the contents of a root file into a database "
    68         "as long as this is technically possible and makes sense. Note that root can even "
    69         "write complex data like a TH1F into a database, this is not the purpose of this "
    70         "program.\n"
    71         "\n"
    72         "Each root tree has branches and leaves (the basic data types). These leaves can "
    73         "be read independently of the classes which were used to write the root file. "
    74         "The default tree to read from is 'Events' but the name can be overwritten "
    75         "using --tree. The default table name to fill the data into is identical to "
    76         "the tree name. It can be overwritten using --table.\n"
    77         "\n"
    78         "To get a list of the contents (keys and trees) of a root file, you can use --print-ls. "
    79         "The name of each column to which data is filled from a leave is obtained from "
    80         "the leaves' names. The leave names can be checked using --print-leaves. "
    81         "A --print-branches exists for convenience to print only the high-level branches. "
    82         "Sometimes these names might be quite unconvenient like MTime.fTime.fMilliSec or "
    83         "just MHillas.fWidth. To allow to simplify column names, regular expressions "
    84         "(using boost's regex) can be defined to change the names. Note that these regular "
    85         "expressions are applied one by one on each leaf's name. A valid expression could "
    86         "be:\n"
    87         "   --map=MHillas\\.f/\n"
    88         "which would remove all occurances of 'MHillas.f'. This option can be used more than "
    89         "once. They are applied in sequence. A single match does not stop the sequence.\n"
    90         "\n"
    91         "Sometimes it might also be convenient to skip a leaf. This can be done with "
    92         "the --ignore resource. If the given regular expresion yields a match, the "
    93         "leaf will be ignored. Note that the regular expression works on the raw-name "
    94         "of the leaf not the readily mapped SQL column names. Example:\n"
    95         "   --ignore=ThetaSq\\..*\n"
    96         "will skip all leaved which start with 'ThetaSq.'. This option can be used"
    97         "more than once.\n"
    98         "\n"
    99         "The data type of each column is kept as close as possible to the leaves' data "
    100         "types. If for some reason this is not wanted, the data type of the SQL column "
    101         "can be overwritten with --sql-type sql-column/sql-ytpe, for example:\n"
    102         "   --sql-type=FileId/UNSIGNED INT\n"
    103         "while the first argument of the name of the SQL column to which the data type "
    104         "should be applied. The second column is the basic SQL data type. The option can "
    105         "be given more than once.\n"
    106         "\n"
    107         "Database interaction:\n"
    108         "\n"
    109         "To drop an existing table, --drop can be used.\n"
    110         "\n"
    111         "To create a table according to theSQL  column names and data types, --create "
    112         "can be used. The query used can be printed with --print-create even --create "
    113         "has not been specified.\n"
    114         "\n"
    115         "To choose the columns which should become primary keys, use --primary, "
    116         "for example:\n"
    117         "   --primary=col1\n"
    118         "To define more than one column as primary key, the option can be given more than "
    119         "once. Note that the combination of these columns must be unique.\n"
    120         "\n"
    121         "All columns are created as NOT NULL as default. To force a database engine "
    122         "and/or a storage format, use --engine and --row-format.\n"
    123         "\n"
    124         "Usually, the INSERT query would fail if the PRIMARY key exists already. "
    125         "This can be avoided using the 'ON DUPLICATE KEY UPDATE' directive. With the "
    126         "--duplicate, you can specify what should be updated in case of a duplicate key. "
    127         "To keep the row untouched, you can just update the primary key "
    128         "with the identical primary key, e.g. --duplicate='MyPrimary=VALUES(MyPrimary)'. "
    129         "The --duplicate resource can be specified more than once to add more expressions "
    130         "to the assignment_list. For more details, see the MySQL manual.\n"
    131         "\n"
    132         "For debugging purpose, or to just create or drop a table, the final insert "
    133         "query can be skipped using --no-insert. Note that for performance reason, "
    134         "all data is collected in memory and a single INSERT query is issued at the "
    135         "end.\n"
    136         "\n"
    137         "Another possibility is to add the IGNORE keyword to the INSERT query by "
    138         "--ignore-errors, which essentially ignores all errors and turns them into "
    139         "warnings which are printed after the query succeeded.\n"
    140         "\n"
    141         "Using a higher verbosity level (-v), an overview of the written columns or all "
    142         "processed leaves is printed depending on the verbosity level. The output looks "
    143         "like the following\n"
    144         "   Leaf name [root data type] (SQL name)\n"
    145         "for example\n"
    146         "   MTime.fTime.fMilliSec [Long64_t] (MilliSec)\n"
    147         "which means that the leaf MTime.fTime.fMilliSec is detected to be a Long64_t "
    148         "which is filled into a column called MilliSec. Leaves with non basic data types "
    149         "are ignored automatically and are marked as (-n/a-). User ignored columns "
    150         "are marked as (-ignored-).\n"
    151         "\n"
    152         "A constant value for the given file can be inserted by using the --const directive. "
    153         "For example --const.mycolumn=42 would insert 42 into a column called mycolumn. "
    154         "The column is created as INT UNSIGNED as default which can be altered by "
    155         "--sql-type. A special case is a value of the form `/regex/format/`. Here, the given "
    156         "regular expression is applied to the filename and it is newly formated with "
    157         "the new format string. Uses the standard formatting rules to replace matches "
    158         "(those used by ECMAScript's replace method).\n"
    159         "\n"
    160         "Usually the previously defined constant values are helpful to create an index "
    161         "which relates unambiguously the inserted data to the file. It might be useful "
    162         "to delete all data which belongs to this particular file before new data is "
    163         "entered. This can be achieved with the `--delete` directive. It deletes all "
    164         "data from the table before inserting new data which fulfills the condition "
    165         "defined by the `--const` directives.\n"
    166         "\n"
    167         "The constant values can also be used for a conditional execution (--conditional). "
    168         "If any row with the given constant values are found, the execution is stopped "
    169         "(note that this happend after the table drop/create but before the delete/insert.\n"
    170         "\n"
    171         "To ensure efficient access for a conditonal execution, it makes sense to have "
    172         "an index created for those columns. This can be done during table creation "
    173         "with the --index option.\n"
    174         "\n"
    175         "To create the index as a UNIQUE INDEX, you can use the --unique option which "
    176         "implies --index.\n"
    177         "\n"
    178         "If a query failed, the query is printed to stderr together with the error message. "
    179         "For the main INSERT query, this is only true if the verbosity level is at least 2 "
    180         "or the query has less than 80*25 bytes.\n"
     68        "As a default, the first row in the file is considered to contain the column "
     69        "names separated by a whitespace. Column names must not contain whitespaces "
     70        "themselves and special characters (':') are replaces by an underscore. "
     71        "If the first line contains the first data row, the --no-header directive "
     72        "can be used to instruct the program to consider the first line as the first "
     73        "data row and use it only for column count. The branch names in the tree "
     74        "are then 'colN' where N is the column index starting from 0.\n"
     75        "\n"
     76        "Each consecutive row in the file is supposed to contain an identical number "
     77        "of floating point values. Leading and trailing whitespaces are ignored. "
     78        "Empty lines or lines starting with a '#' are discarded.\n"
     79        "\n"
     80        "Input and output file are given either as first and second positional argument "
     81        "or with the --file and --out command line option. If no output file name is "
     82        "provided then the input file is used instead and the extension replaced by .root. "
     83        "The target tree name of the root file is given with the --tree command line "
     84        "option or the third positional argument. The default tree name is 'Events'.\n"
     85        "\n"
     86        "As a default, existing files are not overwritten. If overwriting is intended, "
     87        "it can be turned on with --force. To update an existing root file, the "
     88        "--update option can be used. If a tree with the same name already exists, "
     89        "the tree is updated. The compression level for a new root file can be set "
     90        "with --compression.\n"
     91        "\n"
     92        "For several purposes, it might be convenient to split the output to several "
     93        "different root-treess. This can be done using the --split-sequence (-S) "
     94        "and the --split-quantile (-Q) options. If a split sequence is defined as "
     95        "-S 1 -S 2 -S 1 the events are split by 1:2:1 in this sequence order. If "
     96        "quantiles are given as -Q 0.5 -Q 0.6, the first tree will contain 50% of "
     97        "the second one 10% and the third one 40%. The corresponding seed value can "
     98        "be set with --seed.\n"
    18199        "\n"
    182100        "In case of success, 0 is returned, a value>0 otherwise.\n"
    183101        "\n"
    184         "Usage: root2sql [options] -uri URI rootfile.root\n"
     102        "Usage: csv2root input.csv [output.root] [-t tree] [-u] [-f] [-n] [-vN] [-cN]\n"
    185103        "\n"
    186104        ;
     
    200118
    201119
    202 void AddTree(vector<TTree*> &ttree, TFile &file, const string &tree, bool update, int verbose)
     120bool AddTree(vector<TTree*> &ttree, TFile &file, const string &tree, bool update, int verbose)
    203121{
     122    bool found = false;
     123
    204124    TTree *T = 0;
    205125    if (update)
     
    209129        {
    210130            ttree.emplace_back(T);
     131            found = true;
    211132            if (verbose>0)
    212133                cout << "Updating tree: " << tree << endl;
     
    215136    if (!T)
    216137        ttree.emplace_back(new TTree(tree.c_str(), "csv2root"));
     138
     139    return found;
    217140}
    218141
     
    237160    const bool force             = conf.Get<bool>("force");
    238161    const bool update            = conf.Get<bool>("update");
    239     const bool dryrun            = conf.Get<bool>("dry-run");
     162//    const bool dryrun            = conf.Get<bool>("dry-run");
    240163    const bool noheader          = conf.Get<bool>("no-header");
    241164
     
    319242    }
    320243
     244    buf = buf.Strip(TString::kBoth);
    321245    TObjArray *title = buf.Tokenize(" ");
    322246    if (title->GetEntries()==0)
     
    335259
    336260    if (noheader)
     261    {
    337262        fin.seekg(0);
     263        if (verbose>0)
     264            cout << "No header line interpreted." << endl;
     265    }
    338266
    339267    // -------------------------------------------------------------------------
     
    342270    gSystem->ExpandPathName(path);
    343271
    344     if (!dryrun)
     272//    if (!dryrun)
    345273    {
    346274        FileStat_t stat;
     
    377305    vector<TTree*> ttree;
    378306
     307    size_t entries = 0;
    379308    if (num_split==0)
    380         AddTree(ttree, tfile, tree, update, verbose);
     309    {
     310        if (AddTree(ttree, tfile, tree, update, verbose))
     311        {
     312            entries = ttree[0]->GetEntries();
     313            if (verbose>0)
     314                cout << "Tree has " << entries << " entries." << endl;
     315        }
     316    }
    381317    else
    382318    {
     319        bool found = false;
    383320        for (size_t i=0; i<num_split; i++)
    384             AddTree(ttree, tfile, tree+"["+to_string(i)+"]", update, verbose);
    385     }
    386 
     321            found |= AddTree(ttree, tfile, tree+"["+to_string(i)+"]", update, verbose);
     322
     323        if (found && update)
     324        {
     325            cerr << "Trees can not be updated in split mode, only files!" << endl;
     326            return 7;
     327        }
     328    }
    387329
    388330    vector<float> vec(numcol);
     
    406348
    407349    size_t line = 0;
     350    size_t valid = 0;
    408351
    409352    while (1)
     
    413356            break;
    414357
     358        line++;
     359
     360        buf = buf.Strip(TString::kBoth);
     361        if (buf.IsNull() || buf[0]=='#')
     362            continue;
     363
     364        valid++;
     365
    415366        TObjArray *arr = buf.Tokenize(" ");
    416367        if (arr->GetEntries()!=numcol)
    417368        {
    418369            cerr << "Column count mismatch in line " << line+1 << "!" << endl;
    419             return 6;
     370            return 7;
    420371        }
    421372
     
    429380            {
    430381                cerr << "Conversion of '" << arr->At(i)->GetName() << "' failed!" << endl;
    431                 return 7;
     382                return 8;
    432383            }
    433384        }
     
    448399
    449400        ttree[index]->Fill();
    450         line++;
    451401    }
    452402
    453403    if (verbose>0)
    454404    {
    455         cout << line << " data rows read from file." << endl;
    456         for (size_t i=0; i<ttree.size(); i++)
    457             cout << ttree[i]->GetEntries() << " rows filled into tree #" << i << "." << endl;
    458     }
     405        cout << valid << " data rows found in " << line << " lines (excl. title)." << endl;
     406        if (!update || !entries)
     407        {
     408            for (size_t i=0; i<ttree.size(); i++)
     409                cout << ttree[i]->GetEntries() << " rows filled into tree #" << i << "." << endl;
     410        }
     411    }
     412
     413    if (entries && entries!=line)
     414        cerr << "\nWARNING - Number of updated entries does not match number of entries in tree!\n" << endl;
    459415
    460416    for (auto it=ttree.begin(); it!=ttree.end(); it++)
Note: See TracChangeset for help on using the changeset viewer.