from IPython.display import Markdown
import numpy
import pandas
from src import PROCESSED_DATA_DIR
jobs = (
pandas.read_feather(PROCESSED_DATA_DIR / "jobs.feather")
.set_index("id")
.sort_index()
)
assert jobs.index.is_unique
A job is the execution of an action.
An action is a stage in a pipeline.
One job is associated with zero or one actions
(zero, because of missing pipelines and parsing errors).
Hence, an action is a concrete concept:
whilst action a associated with job j1 may have the same invocation as action a associated with job j2,
a-j1 and a-j2 are different actions.
A workspace is a collection of jobs and, hence, a collection of actions; it is a proxy for a study.
We could assume that actions with the same ID that are associated with the same workspace are different executions of the same invocation: that is, they are different executions of the same underlying action. However, we should be cautious because both IDs and invocations may change. For example:
the same ID may have different invocations, such as when a jupyter action type is changed to a python action type.
the same invocation may have different IDs, such as when a more general ID is replaced by a more specific ID, as more actions are added to a pipeline.
Markdown(
f"""
There are {len(jobs):,} jobs.
They were created between {jobs.created_at.min().strftime("%x")} and {jobs.created_at.max().strftime("%x")}
({(jobs.created_at.max() - jobs.created_at.min()).days} days).
"""
)
There are 38,204 jobs. They were created between 10/16/20 and 08/03/22 (655 days).
How many times have actions of each type been executed?
jobs.groupby("action_type").size().sort_values(ascending=False).rename(
"count"
).to_frame()
| count | |
|---|---|
| action_type | |
| r | 22160 |
| stata-mp | 6415 |
| cohortextractor | 4797 |
| python | 2337 |
| jupyter | 1430 |
| cohort-joiner | 174 |
| deciles-charts | 140 |
| cohort-report | 48 |
| cohortextractor-v2 | 25 |
| dataset-report | 17 |
| databuilder | 13 |
| cox-ipw | 5 |
Recognising the need to be cautious, we'd expect underlying actions to be executed more than once per workspace. However, how many times is a normal number of times? Are some types of action executed more than other types of action?
num_runs_per_workspace = (
jobs.groupby(["workspace_id", "action_id", "action_type"]).size().rename("count")
)
num_runs_per_workspace.groupby("action_type").aggregate([numpy.mean, max, min])
| mean | max | min | |
|---|---|---|---|
| action_type | |||
| cohort-joiner | 5.800000 | 20 | 1 |
| cohort-report | 4.363636 | 10 | 1 |
| cohortextractor | 4.193182 | 82 | 1 |
| cohortextractor-v2 | 5.000000 | 10 | 1 |
| cox-ipw | 1.666667 | 3 | 1 |
| databuilder | 13.000000 | 13 | 13 |
| dataset-report | 3.400000 | 7 | 1 |
| deciles-charts | 1.489362 | 13 | 1 |
| jupyter | 12.118644 | 116 | 1 |
| python | 4.085664 | 35 | 1 |
| r | 4.237904 | 136 | 1 |
| stata-mp | 3.344630 | 56 | 1 |
In which cases do underlying actions have different invocations?
num_runs_per_workspace.reset_index().loc[
num_runs_per_workspace.reset_index().duplicated(
["workspace_id", "action_id"],
keep=False,
)
].set_index(["workspace_id", "action_id", "action_type"])
| count | |||
|---|---|---|---|
| workspace_id | action_id | action_type | |
| 16 | run_model | r | 2 |
| stata-mp | 1 | ||
| 28 | generate_report | cohort-report | 4 |
| jupyter | 49 | ||
| 90 | generate_results | jupyter | 15 |
| python | 19 | ||
| 142 | generate_measures | cohortextractor | 5 |
| python | 8 | ||
| 158 | generate_charts | jupyter | 30 |
| python | 1 | ||
| 169 | plot_measure_deciles | jupyter | 2 |
| python | 6 | ||
| 172 | initial_counts | r | 8 |
| stata-mp | 14 | ||
| split_codelist | r | 2 | |
| stata-mp | 1 | ||
| 183 | check_EGFR | jupyter | 1 |
| python | 13 | ||
| test_indicator_g | jupyter | 1 | |
| python | 6 | ||
| 351 | generate_report_bmi | jupyter | 4 |
| python | 5 | ||
| generate_report_height_weight | jupyter | 7 | |
| python | 1 |