Survival curve example¶
In this notebook we will show how to retrieve data from breast cancer patients within the Cancer Genome Atlas and compare the Kaplan-Meier curves of two cohorts. Theese cohorts are white and african american pateints that are 50 years or younger.
In [1]:
Copied!
!pip install lifelines -q
!pip install lifelines -q
In [2]:
Copied!
!pip install fhir-aggregator-client==0.1.8 --no-cache-dir --quiet
!pip install fhir-aggregator-client==0.1.8 --no-cache-dir --quiet
Use Fhir-query to retrieve the necessary data¶
Export TCGA-BRCA data to a local database¶
In [3]:
Copied!
# run query against released data
# !rm /root/.fhir-aggregator/fhir-graph.sqlite
%env FHIR_BASE=https://google-fhir.fhir-aggregator.org
!fq run patient-survival-graph /ResearchStudy?identifier=TCGA-BRCA
# run query against released data
# !rm /root/.fhir-aggregator/fhir-graph.sqlite
%env FHIR_BASE=https://google-fhir.fhir-aggregator.org
!fq run patient-survival-graph /ResearchStudy?identifier=TCGA-BRCA
env: FHIR_BASE=https://google-fhir.fhir-aggregator.org
warning: Database already exists at /home/docs/.fhir-aggregator/fhir-graph.sqlite and will be used. If this is not what you intended, please remove the existing database or provide a new path. patient-survival-graph is valid FHIR R5 GraphDefinition ℹ Fetching https://google-fhir.fhir-aggregator.org/ResearchStudy?identifier=TCGA-BRCA
ℹ Processing ResearchStudy with 2 resources ℹ Processing 2 links for ResearchStudy in parallel. ℹ Processing link: Patient/part-of-study={ref}&_count=1000&_total=accurate with 3 ResearchStudy(s) ℹ Processing link: Observation/part-of-study={ref}&code=NCIT_C156418,NCIT_C156419&_count=1000&_total=accurate with 3 ResearchStudy(s)
✔ Processed link: Patient/part-of-study={ref}&_count=1000&_total=accurate
✔ Processed link: Observation/part-of-study={ref}&code=NCIT_C156418,NCIT_C156419&_count=1000&_total=accurate
Aggregated Results: {'DocumentReference': 7448, 'Group': 1614, 'Observation': 28152, 'Patient': 8033, 'ResearchStudy': 3, 'ResearchSubject': 5837, 'Specimen': 6413}
database available at: /home/docs/.fhir-aggregator/fhir-graph.sqlite
Create a tsv file from the extracted data¶
In [4]:
Copied!
# The previous query included a Specimen, the dataframe type defaults to Specimen
# Since the optimized query only has Patient, we as for a Patient dataframe type
# Note: default output is in the current directory and is a TSV
!fq results dataframe Patient
# The previous query included a Specimen, the dataframe type defaults to Specimen
# Since the optimized query only has Patient, we as for a Patient dataframe type
# Note: default output is in the current directory and is a TSV
!fq results dataframe Patient
Saved fhir-graph.tsv
Survival analysis¶
After retrieving the data, we then use the python library lifelines to plot Kaplan-Meier curves of our two cohorts (white and african american patients who are 50 years or younger).
In [5]:
Copied!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
# read the data into a dataframe
df = pd.read_csv('fhir-graph.tsv', sep='\t')
# get days to death data in the necessary format
df['days_to_death'] = (
df['observation_days_between_diagnosis_and_death']
.str.replace(' days', '', regex=False)
.replace('', np.nan)
.astype(float)
)
# get age data in the necessary format
df['age_at_diagnosis'] = (
df['observation_days_between_birth_and_diagnosis']
.str.replace(' days', '', regex=False)
.replace('', np.nan)
.astype(float)
)
# group by patient_id
df_unique = df.drop_duplicates(subset=['patient_id'])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
# read the data into a dataframe
df = pd.read_csv('fhir-graph.tsv', sep='\t')
# get days to death data in the necessary format
df['days_to_death'] = (
df['observation_days_between_diagnosis_and_death']
.str.replace(' days', '', regex=False)
.replace('', np.nan)
.astype(float)
)
# get age data in the necessary format
df['age_at_diagnosis'] = (
df['observation_days_between_birth_and_diagnosis']
.str.replace(' days', '', regex=False)
.replace('', np.nan)
.astype(float)
)
# group by patient_id
df_unique = df.drop_duplicates(subset=['patient_id'])
Select Breast cancer patients that are white, african american, and 50 years old or younger.
In [6]:
Copied!
df_cohort = df_unique[ (df_unique['age_at_diagnosis'] >= -50*365 )
& (df_unique['patient_us_core_race'].isin(['black or african american','white']) )
& (df_unique['patient_us_core_ethnicity'] == 'not hispanic or latino') ]
df_cohort = df_unique[ (df_unique['age_at_diagnosis'] >= -50*365 )
& (df_unique['patient_us_core_race'].isin(['black or african american','white']) )
& (df_unique['patient_us_core_ethnicity'] == 'not hispanic or latino') ]
Get the necessary data for the lifelines package to generate a plot.
In [7]:
Copied!
# Fill in NAs in days_to_death with the max from the days to death
T = df_cohort['days_to_death'].fillna(df_cohort['days_to_death'].max())
# Convert the vital status to numbers
E = df_cohort['patient_deceasedBoolean'].astype(bool)
# Fill in NAs in days_to_death with the max from the days to death
T = df_cohort['days_to_death'].fillna(df_cohort['days_to_death'].max())
# Convert the vital status to numbers
E = df_cohort['patient_deceasedBoolean'].astype(bool)
Plot the survival curves
In [8]:
Copied!
fig=plt.figure(figsize=(13, 8), dpi= 80)
#plt.style.use('seaborn-colorblind')
ax = plt.subplot(111,
title = "Survival Curve")
for r in df_cohort['patient_us_core_race'].sort_values().unique() :
if (r != None):
cohort = df_cohort['patient_us_core_race'] == r
kmf.fit(T.loc[cohort], E.loc[cohort], label=r)
kmf.plot(ax=ax, )
else:
print("")
ax.set_ylabel("Percent Survival")
ax.set_xlabel("Days")
fig=plt.figure(figsize=(13, 8), dpi= 80)
#plt.style.use('seaborn-colorblind')
ax = plt.subplot(111,
title = "Survival Curve")
for r in df_cohort['patient_us_core_race'].sort_values().unique() :
if (r != None):
cohort = df_cohort['patient_us_core_race'] == r
kmf.fit(T.loc[cohort], E.loc[cohort], label=r)
kmf.plot(ax=ax, )
else:
print("")
ax.set_ylabel("Percent Survival")
ax.set_xlabel("Days")
Out[8]:
Text(0.5, 0, 'Days')