from IPython.display import Markdown
import numpy
import pandas
from src import PROCESSED_DATA_DIR
jobs = (
pandas.read_feather(PROCESSED_DATA_DIR / "jobs.feather")
.set_index("id")
.sort_index()
)
assert jobs.index.is_unique
A job is the execution of an action.
An action is a stage in a pipeline.
One job is associated with zero or one actions
(zero, because of missing pipelines and parsing errors).
Hence, an action is a concrete concept:
whilst action a
associated with job j1
may have the same invocation as action a
associated with job j2
,
a-j1
and a-j2
are different actions.
A workspace is a collection of jobs and, hence, a collection of actions; it is a proxy for a study.
We could assume that actions with the same ID that are associated with the same workspace are different executions of the same invocation: that is, they are different executions of the same underlying action. However, we should be cautious because both IDs and invocations may change. For example:
the same ID may have different invocations, such as when a jupyter action type is changed to a python action type.
the same invocation may have different IDs, such as when a more general ID is replaced by a more specific ID, as more actions are added to a pipeline.
Markdown(
f"""
There are {len(jobs):,} jobs.
They were created between {jobs.created_at.min().strftime("%x")} and {jobs.created_at.max().strftime("%x")}
({(jobs.created_at.max() - jobs.created_at.min()).days} days).
"""
)
There are 38,204 jobs. They were created between 10/16/20 and 08/03/22 (655 days).
How many times have actions of each type been executed?
jobs.groupby("action_type").size().sort_values(ascending=False).rename(
"count"
).to_frame()
count | |
---|---|
action_type | |
r | 22160 |
stata-mp | 6415 |
cohortextractor | 4797 |
python | 2337 |
jupyter | 1430 |
cohort-joiner | 174 |
deciles-charts | 140 |
cohort-report | 48 |
cohortextractor-v2 | 25 |
dataset-report | 17 |
databuilder | 13 |
cox-ipw | 5 |
Recognising the need to be cautious, we'd expect underlying actions to be executed more than once per workspace. However, how many times is a normal number of times? Are some types of action executed more than other types of action?
num_runs_per_workspace = (
jobs.groupby(["workspace_id", "action_id", "action_type"]).size().rename("count")
)
num_runs_per_workspace.groupby("action_type").aggregate([numpy.mean, max, min])
mean | max | min | |
---|---|---|---|
action_type | |||
cohort-joiner | 5.800000 | 20 | 1 |
cohort-report | 4.363636 | 10 | 1 |
cohortextractor | 4.193182 | 82 | 1 |
cohortextractor-v2 | 5.000000 | 10 | 1 |
cox-ipw | 1.666667 | 3 | 1 |
databuilder | 13.000000 | 13 | 13 |
dataset-report | 3.400000 | 7 | 1 |
deciles-charts | 1.489362 | 13 | 1 |
jupyter | 12.118644 | 116 | 1 |
python | 4.085664 | 35 | 1 |
r | 4.237904 | 136 | 1 |
stata-mp | 3.344630 | 56 | 1 |
In which cases do underlying actions have different invocations?
num_runs_per_workspace.reset_index().loc[
num_runs_per_workspace.reset_index().duplicated(
["workspace_id", "action_id"],
keep=False,
)
].set_index(["workspace_id", "action_id", "action_type"])
count | |||
---|---|---|---|
workspace_id | action_id | action_type | |
16 | run_model | r | 2 |
stata-mp | 1 | ||
28 | generate_report | cohort-report | 4 |
jupyter | 49 | ||
90 | generate_results | jupyter | 15 |
python | 19 | ||
142 | generate_measures | cohortextractor | 5 |
python | 8 | ||
158 | generate_charts | jupyter | 30 |
python | 1 | ||
169 | plot_measure_deciles | jupyter | 2 |
python | 6 | ||
172 | initial_counts | r | 8 |
stata-mp | 14 | ||
split_codelist | r | 2 | |
stata-mp | 1 | ||
183 | check_EGFR | jupyter | 1 |
python | 13 | ||
test_indicator_g | jupyter | 1 | |
python | 6 | ||
351 | generate_report_bmi | jupyter | 4 |
python | 5 | ||
generate_report_height_weight | jupyter | 7 | |
python | 1 |