Vahe Hakobyan

Enabling configurable projects for ML experiments through GIN

Motivation

We encountered situation when we find ourselves in a very high experimental setup. The task given to us has many approaches out there in form of papers. On the top of that we have our own experience and gut feeling on what can help to achieve better results. Now we have many tools out there to track experiments and log hyper-parameters. But is that enough? There are two questions come to my mind:

Can we log everything ? e.g. what kind of augmentation do we use for a particular run - Gaussian blur? if yes, what’s the kernel size ?
Even if we do, how do we ensure maximal flexibility and visibility of that? How do we make most of the stuff configurable in the code without turning the code-base in to a total mess ?

One desirable property of combining both a good configurable project and experiment tracking tool

is to keep all configurables in a configuration file,
use experiment tracking tool to track metrics and losses and maybe some hyper-parameters. But these things are not only configurables right? What if I want to change the architecture of the generator network? Or what if I have changed the dataset by adding 100k images to it?
upload configuration file as an artifact

These approach seems to enable first concern from above - now we can literally keep all configurable aspects of the project in the configuration file. And log everything we want to track over time.

Now how do we enable configuration flexible without harming the code-base ?

Let’s go over an example…

Pytorch Lightning example - vanilla

Here I would like to talk about gin-config that does not get attention it deserves. Maybe it is too complicated or maybe google folks were not able to motivate it enough in their “documentation”. So I have decided to spend some time to create a walkthrough of a “real world example” and show where it shines. Real world is a bit stretch here, cause it’s gonna be….good old MNIST example.

basic-gan example can serve us a guinea pig to understand the benefits of having gin-config around. Instead of having one single file containing all the code from example I have tried to create a project like structure:

run.py - entry point of the application Vanilla entry data_module.py - PytorchLightning DataModule - responsible for DataSet and DataLoader creation Vanilla data module model.py - contains the PytorchLightning module and training logic Vanilla model

Red parts above mean that ideally I would like to change those parts through configuration. One can argue - this is already a good setup and when someone changes something, one can commit changes and keep track of the through git. Yes, with the same luck try to convince me that there are dragons. In practice, what happens in experimental mode is that many ideas are tried out together and code is not commited until a “stable” version is reached (whatever “stable” means here).

So how do we do it?

Option 1: Command line arguments

Let’s try to imagine the number arguments we would need to have to achieve a flexibility we want - meaning, making the red areas configurable. For the sake of demonstration I will show what steps might look like only for making DataModule configurable and some aspects of the Model training. Hopefully this will demonstrate shortcomings.

run.py - entry point of the application CLI entry

model.py - contains the PytorchLightning module and training logic CLI model

Pros

injecting generator and discriminator to the LightningModule. Hardcoding generator and discriminator was a bad design choice from the very beginning. So I would not attribute this purely to configurability.
- There is another option, to pass configuration object to the LightningModule and create generator and discriminator inside. It is a BAD design choice, as it requires the user completely rely on configuration argument - know its structure and blow it up further as something additional is required.
obvious one - it made some of the aspects of the project configurable.

Now to understand the cons, let’s look at the contents of the run.py

Cons

in order to run this application my command will look like this python run.py --path_dataset ../ --batch_size 256 --num_workers 64 --latent_dim 100 --generator_base_depth 128 --generator_arch arch1. And this is only for configuring small aspects of the dataset/dataloader and really negligible aspects of the generator. Imagine if
- we need to add more information about the structure of the generator, discriminator, loss functions, optimizers.
- different generators have different configurable parameters
- different losses have different parameters
- different optimizers have different parameters
Handling everything will turn into a nightmare and the size of the command blow up into a poem. This will result in confusion and high cognitive load. One can keep some arguments to default, but this will only add to the confusion rather than elevate it. Imagine going through the code and figuring out why a particular value was passed to a certain function.
Only by looking at command line, there is no clear attribution of arguments to the entities they are intended to. One mitigation for this is using sub_parsers.
pay attention to the get_generator, now you will be forced to have a series of such functions for each customizable component (losses, optimizers …). Each generator might have different parametric structure, this will result in following
- Every time one needs to try out a different generator, one needs to do back and forth between get_generator and argparse to understand which key values to pass
- Every time one adds a new generator one needs to accomodate the same knowledge about its configuration in BOTH, argparse and ggt_generator.
Having functions with structure of get_generator is not a good engineering practice and goes against SOLID principles
Last but not least, it adds the cognitive between purpose of particular argument and its implementation. As the argument intended for a DataLoader might be used for constructing model architecture. And there is no way of preventing that.

Option 2: Config yamls, tomls…

Let’s see what gets better and what gets worse…

config.yaml Yaml config

run.py - entry point of the application Yaml entry

Pros

Cleaner CLI, removes the confusion by adding consisenes
Ability to natururally have nested structures

Cons

Basides the first point nothing really improved compared to Command Line arguments

One still have to deal with functions like get_generator and all the headache assotiated with them
Problems with cognitive load and disonance did not go away
Still a lot of if/elses in the code that is supposed to create objects in the code based on some configuraiton values e.g.
Additionally now parsed configuration is a dictionary, which means that user is free to add key/values (modify existing ones) in it. And I have witnessed a code examples where this is getting abused.

Solution: gin-configs

Now let’s see how gin-configs help you with above problems …

config.gin GIN Config

run.py - entry point of the application GIN entry

model.py - contains the PytorchLightning module and training logic GIN model

Pros (adding to the pros from yaml)

Clear dependency between components defined in the configuration file
Possibility to define a value in a form of a macro and pass it as a value to other object arguments
No more functions as get_generator as entities and their interaction is directly defined in the config file
Logic is defined in one place - from the perspective of cognitive load one has to look at object definition and fill parameter values for object/function creation in the config
Very hard to alter the configuration values, unless one of them is defined as a dictionary and afterwards altered in the application
Minimal entry barrier - object/function becomes gin configurable as soon as @gin.configurable is added as a decorator

Cons

one shortcoming that I found is that one has to import all configurables e.g. from arch.generators.generator import Generator1, Generator2 even though one generator is used during one run. This will result tools like pylint complain about having unused imports.

Above you have seen a nasty code snippet with optimizers…let’s see how we can leverage this elegantly with gin.

config.gin GIN Config

model.py - contains the PytorchLightning module and training logic GIN model

Notes

Namespaces - are needed when multiple objects of the same class need to be created. Otherwise the last configuration values will be applied to every object
Classes vs Objects:
- Objects - objects are created during runtime with parameter values defined in the config file
- Classes - whenever object is created, unless specified differently, default paramete values will be used from the config file.