Exploration¶

Prerequisites¶

In [1]:

from IPython.display import Markdown
import numpy
import pandas
from src import PROCESSED_DATA_DIR

In [2]:

jobs = (
    pandas.read_feather(PROCESSED_DATA_DIR / "jobs.feather")
    .set_index("id")
    .sort_index()
)
assert jobs.index.is_unique

Nomenclature¶

A job is the execution of an action. An action is a stage in a pipeline. One job is associated with zero or one actions (zero, because of missing pipelines and parsing errors). Hence, an action is a concrete concept: whilst action a associated with job j1 may have the same invocation as action a associated with job j2, a-j1 and a-j2 are different actions.

A workspace is a collection of jobs and, hence, a collection of actions; it is a proxy for a study.

We could assume that actions with the same ID that are associated with the same workspace are different executions of the same invocation: that is, they are different executions of the same underlying action. However, we should be cautious because both IDs and invocations may change. For example:

the same ID may have different invocations, such as when a jupyter action type is changed to a python action type.
the same invocation may have different IDs, such as when a more general ID is replaced by a more specific ID, as more actions are added to a pipeline.

In [3]:

Markdown(
    f"""
There are {len(jobs):,} jobs.
They were created between {jobs.created_at.min().strftime("%x")} and {jobs.created_at.max().strftime("%x")}
({(jobs.created_at.max() - jobs.created_at.min()).days} days).
"""
)

Out[3]:

There are 38,204 jobs. They were created between 10/16/20 and 08/03/22 (655 days).

Analysis¶

How many times have actions of each type been executed?

In [4]:

jobs.groupby("action_type").size().sort_values(ascending=False).rename(
    "count"
).to_frame()

Out[4]:

	count
action_type
r	22160
stata-mp	6415
cohortextractor	4797
python	2337
jupyter	1430
cohort-joiner	174
deciles-charts	140
cohort-report	48
cohortextractor-v2	25
dataset-report	17
databuilder	13
cox-ipw	5

Recognising the need to be cautious, we'd expect underlying actions to be executed more than once per workspace. However, how many times is a normal number of times? Are some types of action executed more than other types of action?

In [5]:

num_runs_per_workspace = (
    jobs.groupby(["workspace_id", "action_id", "action_type"]).size().rename("count")
)

In [6]:

num_runs_per_workspace.groupby("action_type").aggregate([numpy.mean, max, min])

Out[6]:

	mean	max	min
action_type
cohort-joiner	5.800000	20	1
cohort-report	4.363636	10	1
cohortextractor	4.193182	82	1
cohortextractor-v2	5.000000	10	1
cox-ipw	1.666667	3	1
databuilder	13.000000	13	13
dataset-report	3.400000	7	1
deciles-charts	1.489362	13	1
jupyter	12.118644	116	1
python	4.085664	35	1
r	4.237904	136	1
stata-mp	3.344630	56	1

In which cases do underlying actions have different invocations?

In [7]:

num_runs_per_workspace.reset_index().loc[
    num_runs_per_workspace.reset_index().duplicated(
        ["workspace_id", "action_id"],
        keep=False,
    )
].set_index(["workspace_id", "action_id", "action_type"])

Out[7]:

			count
workspace_id	action_id	action_type
16	run_model	r	2
16	run_model	stata-mp	1
28	generate_report	cohort-report	4
28	generate_report	jupyter	49
90	generate_results	jupyter	15
90	generate_results	python	19
142	generate_measures	cohortextractor	5
142	generate_measures	python	8
158	generate_charts	jupyter	30
158	generate_charts	python	1
169	plot_measure_deciles	jupyter	2
169	plot_measure_deciles	python	6
172	initial_counts	r	8
	initial_counts	stata-mp	14
	split_codelist	r	2
	split_codelist	stata-mp	1
183	check_EGFR	jupyter	1
	check_EGFR	python	13
	test_indicator_g	jupyter	1
	test_indicator_g	python	6
351	generate_report_bmi	jupyter	4
	generate_report_bmi	python	5
	generate_report_height_weight	jupyter	7
	generate_report_height_weight	python	1