Getting Started¶
This guide introduces the config-versioned package, which simplifies management of
project settings and file I/O by combining them in a single Config
object.
Data pipelines commonly require reading and writing data to versioned directories. Each
directory might correspond to one step of a multi-step process, where the version
corresponds to particular settings for that step and a chain of prior steps that each
have their own respective versions. The Config class makes it easy
to read and write versioned data based on YAML configuration files that can be saved
alongside each versioned output folder.
Loading a config file¶
YAML is a natural format for storing project settings, since it can represent numeric, string, and boolean settings as well as hierarchically nested values. The config-versioned package ships with an example config file you can use to follow along:
import importlib.resources as r
from config_versioned import Config
example_config_path = str(r.files("config_versioned") / "data" / "example_config.yaml")
The example YAML file looks like this:
a: 'foo'
b: ['bar', 'baz']
group_c:
d: 1e5
e: false
directories:
raw_data:
versioned: false
path: '~/versioning_test/raw_data'
files:
a: 'example_input_file.csv'
prepared_data:
versioned: true
path: '~/versioning_test/prepared_data'
files:
prepared_table: 'example_prepared_table.csv'
summary_text: 'summary_of_rows.txt'
versions:
prepared_data: 'v1'
Create a Config object by passing either a path to a YAML file
or a plain Python dict. The full config is stored in the config attribute:
config = Config(example_config_path)
print(config) # pprint-formatted view of the config dict
Retrieving settings¶
You can access the config dict directly (config.config["a"]), but
get() is preferable — it raises a clear KeyError if a
setting is missing rather than returning None silently:
config.get("a") # 'foo'
config.get("b") # ['bar', 'baz']
config.get("group_c", "d") # 100000.0 (nested access)
# Returns None instead of raising, when fail_if_none=False
config.get("nonexistent", fail_if_none=False) # None
Settings can be updated in place by editing config.config directly:
config.config["a"] = 12345
config.get("a") # 12345
Working with directories¶
Two top-level keys — directories and versions — give the
Config object its versioning capability. Each entry under
directories must have:
versioned (bool) — whether the directory has version subdirectories
path (str) — base path to the directory (tilde expansion applied)
files (dict) — named file stubs within the directory
For versioned directories the full path is {path}/{version}, where the version
string comes from the versions dict. For non-versioned directories the full
path is just path.
Use get_dir_path() and
get_file_path() to build these paths:
import tempfile, shutil
from pathlib import Path
# Redirect both directories to temporary folders for this example
tmp = Path(tempfile.mkdtemp())
raw_dir = tmp / "raw_data"
prepared_dir = tmp / "prepared_data"
raw_dir.mkdir()
prepared_dir.mkdir()
config.config["directories"]["raw_data"]["path"] = str(raw_dir)
config.config["directories"]["prepared_data"]["path"] = str(prepared_dir)
# get_dir_path() returns a pathlib.Path
config.get_dir_path("raw_data") # tmp/raw_data (not versioned)
config.get_dir_path("prepared_data") # tmp/prepared_data/v1 (versioned → appends version)
# Create the versioned subdirectory
config.get_dir_path("prepared_data").mkdir()
# get_file_path() appends the named file stub
config.get_file_path("raw_data", "a")
# tmp/raw_data/example_input_file.csv
config.get_file_path("prepared_data", "prepared_table")
# tmp/prepared_data/v1/example_prepared_table.csv
Notice that the “prepared_data” path ends in v1 because
config.get("versions", "prepared_data") is "v1". Changing that setting changes
where all subsequent reads and writes for that directory go.
Reading and writing files¶
Copy the bundled example CSV into the raw data directory, then use
read() and write() to move data
through the pipeline:
# Copy the example input file into the raw_data directory
example_csv = str(r.files("config_versioned") / "data" / "example_input_file.csv")
shutil.copy(example_csv, config.get_file_path("raw_data", "a"))
# Read the CSV (returns a pandas DataFrame)
df = config.read("raw_data", "a")
# Write a prepared table and a plain-text summary to the versioned directory
config.write(df, "prepared_data", "prepared_table")
config.write(
f"The prepared table has {len(df)} rows and {len(df.columns)} columns.\n",
"prepared_data",
"summary_text",
)
# Both files now appear in the versioned directory
list(config.get_dir_path("prepared_data").iterdir())
These methods delegate to autoread() and
autowrite(), which dispatch on file extension. To see every
supported extension:
from config_versioned import get_file_reading_functions, get_file_writing_functions
sorted(get_file_reading_functions().keys())
sorted(get_file_writing_functions().keys())
Saving the config alongside outputs¶
write_self() writes the current config dict as
config.yaml into a named directory. This is useful for reproducibility — you can
always see exactly which settings produced a given set of outputs:
config.write_self("prepared_data")
# config.yaml now appears alongside the outputs
list(config.get_dir_path("prepared_data").iterdir())
Overriding versions at runtime¶
Rather than editing the YAML file between runs, you can pass a versions dict when
constructing a Config object. This sets or overwrites the
specified versions while leaving all other settings unchanged — useful for command-line
scripts or automated pipelines:
config_v2 = Config(example_config_path, versions={"prepared_data": "v2"})
config_v2.get_dir_path("prepared_data")
# tmp/prepared_data/v2
# Other versions (and all other settings) are untouched
config_v2.get("versions", "prepared_data") # 'v2'
Next steps¶
Installation — full installation options and supported file extensions
API Reference — complete API reference