Preprocessing

If you want to avoid producing and preprocessing your own samples, you can find ready to train h5 files at /eos/user/u/umami/training-samples/gnn. Please note, training files are suggested to follow a certain directory structure, which is based on the output structure of umami preprocessing jobs.

- base_dir/
    - train_sample_1/
        # umami configuration
        - PFlow-Preprocessing-GNN_write.yaml
        - PFlow-scale_dict.json
        - GNN_Variables.yaml

        # tdd output datasets
        - source_data/
            - tdd_output_ttbar/
            - tdd_output_zprime/

        # umami hybrid samples
        - hybrids/
            - umami_test_ttbar.h5
            - umami_test_zprime.h5

        # umami preprocessed samples
        - preprocessed/
            - train.h5
            - val.h5

Dumping Training Samples#

You can create training samples using the FTAG training dataset dumper. You need to use the EMPFlowGNN.json config file to dump the additional information required to train the GNN.

r21 trainings

r21 samples have been produced in a fork of the official dumper, which is not planned to be merged upstream, due to the deprecation of r21. r22 samples are supported in the official TDD r22 branch.

Note that predumped h5 samples are available here.

Preprocessing with Umami#

The h5 files are processed by the umami framework to produce training files. The umami framework handles jet selection, kinematic resampling, and input normalisation. Use this preprocessing config and this variable config for the creation of train samples for the GNN.

For more information, take a look at the umami docs

Creating the Validation Sample#

The Umami preprocessing script outputs a single training file, and a validation file for each sample. For the GNN, we use a combined validation sample which follows the distribution of the training sample in order to better keep an eye on the trainings. We create this small validation sample by splitting the training file outputted by umami. To create the seperate training and validation files, use the script as follows:

cd gnn_tagger
python postprocessing/split_train_val.py -f [path to hybrid-resampled-shuffled .h5]

The new train and validation files will be saved to the same path, with the same file name suffixed with '_train' and '_val' respectedly. By default, 500k jets will be used for validation, and the rest used for training. The number of validation jets can be changed to either a fraction --f_val or a number of jets --n_val. All remaining jets not used for validation will be used in the training dataset. To limit the number of training jets used for training use --n_train.

By default, other characteristics such as dataset precision and compression will be copied over from the train file. To change these use --comp [gzip, lzf, None] to overwrite the compression format, and --precision [16, 32, 64] to set the level of output precision.

To change the output file name use --out_name which will save the training and validation files to the same directory with the new name. To specify the path to a different directory in which to save the training and validation files, use --out_file_path. Use -d to enable debug - which will provide more information while converting.

Generating Configs#

Generate yaml config files that describe the GNN model, including paths to the training files, and many options to configure the training. Some outputs from the umami preprocessing pipeline are required, so you will need to run this first for your samples.

You need a variable config file, which specifies the input shape for the network, and also a scale dict, which is written during scaling and shifting of the inputs during umami preprocessing. The path of the scale dict is specified also within umami. Once you have run the umami preprocessing yourself, you will need to edit the generate_configs.py to point towards your own training samples, variable conifg, and scale dict.

Then, you will be ready generate the yaml configs that are used to train.

cd gnn_tagger/training/
python generate_configs.py

The config files are saved to the configs/ dir. There are different config files for the different versions of the GNN. Each can be trained with the train.py as documented.

Using new umami vs old umami

By default, generate_configs now assumes that you are using the updated umami outputs, which are not compatible with the original framework. If you are using the old format, ensure you change 'new_umami' and 'preconcat_jet_tracks' in the config to False. Also ensure your labels are correct. Umami originally utilised 0,1,2 for b, c, and light jets, but this has now been switched to 0,1,2 for light, c, and b jets.

Pre-building graphs#

Once the config file has been created, you can train the model by following steps in the training documentation. To greatly improve training speed, you should prebuild the graphs using the PrepareAllGraphs.py script. Please note, this requires a certain directory structure as outlined here.

cd gnn_tagger/postprocessing
python PrepareAllGraphs.py -c [config file]

By default, all available jets in the training file will be converted to graphs, and 500,000 jets will be converted into graphs for the validation dataset. To change these amounts, one can use the --n_train, --n_val, and --n_test arguments. The script will save 50,000 graphs per file by default, this can be changed with the --gpf.

The training config will be edited to include the paths to graphs, as well as setting the prebuilt_graphs flag to True, resulting in prebuilt graphs being the default data loading mode when training the model.

By default, graphs will be produced on only 1 thread, this will take a significant amount of time. More threads can be used via the --threads argument for the script. Memory usage per thread is high, so we recommend using ~10-20 threads. If more than one compute node is available, the graph preparation can be split through the --idx_range (i_lower, i_upper) flag. This will result in graph files i_lower -> i_upper - 1 being produced by the script. Validation and test graphs will be produced only in the script call where i_lower = 0.

The usage then, for an example number of training graphs of 15,000,000, spread over 3 nodes, would be:

python PrepareAllGraphs.py -c [config file] --threads 10 --n_train 15_000_000 --idx_range 0 100
python PrepareAllGraphs.py -c [config file] --threads 10 --n_train 15_000_000 --idx_range 100 200
python PrepareAllGraphs.py -c [config file] --threads 10 --n_train 15_000_000 --idx_range 200 300

The --debug flag can be used when calling the script to print more debug information.

If part of the PrepareAllGraphs script fails, the following can be used:

python prepare_graphs.py --config [training config] --file [file to convert] --sample [train, val, or test] -t [number of threads] --out_name [save folder]

The saved path of the graphs should then be manually added to the config file.