A short, trivial note on keeping data science experiments organized.

Lately, I’ve found myself running a lot of potentially throw-away batch jobs with quite a few different hyper-parameters. As we know, things can get a big tangled and disorganized rather quickly. It prompted me to take a look at hydra, mlflow, the cookie-cutter ds project, and sundry make and config utilities. I hope you like them, but I came away thinking that most were overkill for what I was doing just at the moment.

What might be the root of the problem? I fear most configuration patterns suffer from a single flaw — they violate the Don’t Repeat Yourself (DRY) maxim because they want you to edit the name of a file and also specify what it does. I’m too lazy to keep things in sync, and so are most people. So mess ensues eventually.

In the end, I simply decided to recycle the same stub script over and over again, and change the name of the file instead of the contents. So my experiment directory is now self-explanatory:

The names of the files are parse_sql compatible (urllib). I added the optional convention you see to specify a type conversion from str to int or float.

The code in each experimental script is identical. It infers parameters:

I took the extra twenty minutes to package up runthis on PyPI to make this a little pithier than it would otherwise be. So, if you choose to adopt the same pattern, then pip install runthis, cut and paste the example and you’re done.

For full reproducibility, just commit and push.

(By the way, I don’t claim this pattern is original — how could it be? If this is already subsumed in some other package let me know and I’ll put a note in the repo). Enjoy.

Chief Data Scientist, Intech Investments