Exploration¶

Prerequisites¶

In [1]:
from IPython.display import Markdown
import numpy
import pandas
from src import PROCESSED_DATA_DIR
In [2]:
jobs = (
    pandas.read_feather(PROCESSED_DATA_DIR / "jobs.feather")
    .set_index("id")
    .sort_index()
)
assert jobs.index.is_unique

Nomenclature¶

A job is the execution of an action. An action is a stage in a pipeline. One job is associated with zero or one actions (zero, because of missing pipelines and parsing errors). Hence, an action is a concrete concept: whilst action a associated with job j1 may have the same invocation as action a associated with job j2, a-j1 and a-j2 are different actions.

A workspace is a collection of jobs and, hence, a collection of actions; it is a proxy for a study.

We could assume that actions with the same ID that are associated with the same workspace are different executions of the same invocation: that is, they are different executions of the same underlying action. However, we should be cautious because both IDs and invocations may change. For example:

  • the same ID may have different invocations, such as when a jupyter action type is changed to a python action type.

  • the same invocation may have different IDs, such as when a more general ID is replaced by a more specific ID, as more actions are added to a pipeline.

In [3]:
Markdown(
    f"""
There are {len(jobs):,} jobs.
They were created between {jobs.created_at.min().strftime("%x")} and {jobs.created_at.max().strftime("%x")}
({(jobs.created_at.max() - jobs.created_at.min()).days} days).
"""
)
Out[3]:

There are 38,204 jobs. They were created between 10/16/20 and 08/03/22 (655 days).

Analysis¶

How many times have actions of each type been executed?

In [4]:
jobs.groupby("action_type").size().sort_values(ascending=False).rename(
    "count"
).to_frame()
Out[4]:
count
action_type
r 22160
stata-mp 6415
cohortextractor 4797
python 2337
jupyter 1430
cohort-joiner 174
deciles-charts 140
cohort-report 48
cohortextractor-v2 25
dataset-report 17
databuilder 13
cox-ipw 5

Recognising the need to be cautious, we'd expect underlying actions to be executed more than once per workspace. However, how many times is a normal number of times? Are some types of action executed more than other types of action?

In [5]:
num_runs_per_workspace = (
    jobs.groupby(["workspace_id", "action_id", "action_type"]).size().rename("count")
)
In [6]:
num_runs_per_workspace.groupby("action_type").aggregate([numpy.mean, max, min])
Out[6]:
mean max min
action_type
cohort-joiner 5.800000 20 1
cohort-report 4.363636 10 1
cohortextractor 4.193182 82 1
cohortextractor-v2 5.000000 10 1
cox-ipw 1.666667 3 1
databuilder 13.000000 13 13
dataset-report 3.400000 7 1
deciles-charts 1.489362 13 1
jupyter 12.118644 116 1
python 4.085664 35 1
r 4.237904 136 1
stata-mp 3.344630 56 1

In which cases do underlying actions have different invocations?

In [7]:
num_runs_per_workspace.reset_index().loc[
    num_runs_per_workspace.reset_index().duplicated(
        ["workspace_id", "action_id"],
        keep=False,
    )
].set_index(["workspace_id", "action_id", "action_type"])
Out[7]:
count
workspace_id action_id action_type
16 run_model r 2
stata-mp 1
28 generate_report cohort-report 4
jupyter 49
90 generate_results jupyter 15
python 19
142 generate_measures cohortextractor 5
python 8
158 generate_charts jupyter 30
python 1
169 plot_measure_deciles jupyter 2
python 6
172 initial_counts r 8
stata-mp 14
split_codelist r 2
stata-mp 1
183 check_EGFR jupyter 1
python 13
test_indicator_g jupyter 1
python 6
351 generate_report_bmi jupyter 4
python 5
generate_report_height_weight jupyter 7
python 1