Survival curve example¶

In this notebook we will show how to retrieve data from breast cancer patients within the Cancer Genome Atlas and compare the Kaplan-Meier curves of two cohorts. Theese cohorts are white and african american pateints that are 50 years or younger.

In [1]:

Copied!

!pip install lifelines -q
!pip install lifelines -q

In [2]:

Copied!

!pip install fhir-aggregator-client==0.1.8 --no-cache-dir --quiet
!pip install fhir-aggregator-client==0.1.8 --no-cache-dir --quiet

Use Fhir-query to retrieve the necessary data¶

Export TCGA-BRCA data to a local database¶

In [3]:

Copied!





# run query against released data
# !rm /root/.fhir-aggregator/fhir-graph.sqlite
%env  FHIR_BASE=https://google-fhir.fhir-aggregator.org
!fq run patient-survival-graph    /ResearchStudy?identifier=TCGA-BRCA
# run query against released data
# !rm /root/.fhir-aggregator/fhir-graph.sqlite
%env  FHIR_BASE=https://google-fhir.fhir-aggregator.org
!fq run patient-survival-graph    /ResearchStudy?identifier=TCGA-BRCA

env: FHIR_BASE=https://google-fhir.fhir-aggregator.org

warning: Database already exists at /home/docs/.fhir-aggregator/fhir-graph.sqlite and will be used. If this is not what you intended, please remove the existing database or provide a new path.
patient-survival-graph is valid FHIR R5 GraphDefinition

ℹ Fetching https://google-fhir.fhir-aggregator.org/ResearchStudy?identifier=TCGA-BRCA

ℹ Processing ResearchStudy with 2 resources

ℹ Processing 2 links for ResearchStudy in parallel.

ℹ Processing link: Patient/part-of-study={ref}&_count=1000&_total=accurate with 3 ResearchStudy(s)

ℹ Processing link: Observation/part-of-study={ref}&code=NCIT_C156418,NCIT_C156419&_count=1000&_total=accurate with 3 ResearchStudy(s)


✔ Processed link: Patient/part-of-study={ref}&_count=1000&_total=accurate


✔ Processed link: Observation/part-of-study={ref}&code=NCIT_C156418,NCIT_C156419&_count=1000&_total=accurate

Aggregated Results: {'DocumentReference': 7448, 'Group': 1614, 'Observation': 28152, 'Patient': 8033, 'ResearchStudy': 3, 'ResearchSubject': 5837, 'Specimen': 6413}
database available at: /home/docs/.fhir-aggregator/fhir-graph.sqlite

Create a tsv file from the extracted data¶

In [4]:

Copied!





# The previous query included a Specimen,  the dataframe type defaults to Specimen
# Since the optimized query only has Patient, we as for a Patient dataframe type
# Note: default output is in the current directory and is a TSV
!fq results dataframe Patient
# The previous query included a Specimen,  the dataframe type defaults to Specimen
# Since the optimized query only has Patient, we as for a Patient dataframe type
# Note: default output is in the current directory and is a TSV
!fq results dataframe Patient

Saved fhir-graph.tsv

Survival analysis¶

After retrieving the data, we then use the python library lifelines to plot Kaplan-Meier curves of our two cohorts (white and african american patients who are 50 years or younger).

In [5]:

Copied!





import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()


# read the data into a dataframe
df = pd.read_csv('fhir-graph.tsv', sep='\t')

# get days to death data in the necessary format
df['days_to_death'] = (
    df['observation_days_between_diagnosis_and_death']
    .str.replace(' days', '', regex=False)
    .replace('', np.nan)
    .astype(float)
)
# get age data in the necessary format
df['age_at_diagnosis'] = (
    df['observation_days_between_birth_and_diagnosis']
    .str.replace(' days', '', regex=False)
    .replace('', np.nan)
    .astype(float)
)

# group by patient_id
df_unique = df.drop_duplicates(subset=['patient_id'])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()


# read the data into a dataframe
df = pd.read_csv('fhir-graph.tsv', sep='\t')

# get days to death data in the necessary format
df['days_to_death'] = (
    df['observation_days_between_diagnosis_and_death']
    .str.replace(' days', '', regex=False)
    .replace('', np.nan)
    .astype(float)
)
# get age data in the necessary format
df['age_at_diagnosis'] = (
    df['observation_days_between_birth_and_diagnosis']
    .str.replace(' days', '', regex=False)
    .replace('', np.nan)
    .astype(float)
)

# group by patient_id
df_unique = df.drop_duplicates(subset=['patient_id'])

Select Breast cancer patients that are white, african american, and 50 years old or younger.

In [6]:

Copied!

df_cohort = df_unique[ (df_unique['age_at_diagnosis'] >= -50*365 )
                      & (df_unique['patient_us_core_race'].isin(['black or african american','white']) )
                      & (df_unique['patient_us_core_ethnicity'] == 'not hispanic or latino')   ]
df_cohort = df_unique[ (df_unique['age_at_diagnosis'] >= -50*365 )
                      & (df_unique['patient_us_core_race'].isin(['black or african american','white']) )
                      & (df_unique['patient_us_core_ethnicity'] == 'not hispanic or latino')   ]

Get the necessary data for the lifelines package to generate a plot.

In [7]:

Copied!

# Fill in NAs in days_to_death with the max from the days to death
T = df_cohort['days_to_death'].fillna(df_cohort['days_to_death'].max())

# Convert the vital status to numbers
E = df_cohort['patient_deceasedBoolean'].astype(bool)
# Fill in NAs in days_to_death with the max from the days to death
T = df_cohort['days_to_death'].fillna(df_cohort['days_to_death'].max())

# Convert the vital status to numbers
E = df_cohort['patient_deceasedBoolean'].astype(bool)

Plot the survival curves

In [8]:

Copied!





fig=plt.figure(figsize=(13, 8), dpi= 80)
#plt.style.use('seaborn-colorblind')
ax = plt.subplot(111,
                 title = "Survival Curve")

for r in  df_cohort['patient_us_core_race'].sort_values().unique() :
  if (r != None):
    cohort = df_cohort['patient_us_core_race'] == r
    kmf.fit(T.loc[cohort], E.loc[cohort], label=r)
    kmf.plot(ax=ax, )
  else:
    print("")

ax.set_ylabel("Percent Survival")
ax.set_xlabel("Days")
fig=plt.figure(figsize=(13, 8), dpi= 80)
#plt.style.use('seaborn-colorblind')
ax = plt.subplot(111,
                 title = "Survival Curve")

for r in  df_cohort['patient_us_core_race'].sort_values().unique() :
  if (r != None):
    cohort = df_cohort['patient_us_core_race'] == r
    kmf.fit(T.loc[cohort], E.loc[cohort], label=r)
    kmf.plot(ax=ax, )
  else:
    print("")

ax.set_ylabel("Percent Survival")
ax.set_xlabel("Days")

Out[8]:

Text(0.5, 0, 'Days')

No description has been provided for this image