Accessing the FHIR Aggregator Vocabulary¶

The Vocabulary DataFrame: A Researcher's Guide to Data Elements¶

Imagine you have a vast collection of FHIR data, containing medical records, research studies, and various observations. Within this data, there are numerous CodeableConcepts and Extensions that provide structure and meaning to the information. However, as a researcher, it's crucial to have a clear overview of these key data elements and how they're used.

This is where the Vocabulary DataFrame comes in.

The Vocabulary DataFrame is essentially a summary table that catalogs the important CodeableConcepts and Extensions found within the FHIR dataset. It acts as a central inventory, providing researchers with valuable insights into the structure and content of the data.

Here's how it helps:

Identifying Key Data Elements: The DataFrame lists all the significant CodeableConcepts and Extensions used within the data, giving researchers a comprehensive view of the elements present.

Understanding Code Systems and Terminology: It provides information about the code systems and terminologies associated with each CodeableConcept (e.g., SNOMED CT, LOINC), helping researchers interpret the coded data.

Exploring Data Structure and Usage: The DataFrame reveals where these CodeableConcepts and Extensions are used within different FHIR resources and elements (e.g., Condition.code, Observation.valueCodeableConcept). This helps researchers understand how the data is structured and how these elements relate to each other.

Navigating to Specific Data: It often includes FHIR queries that can be used to directly access the data associated with each CodeableConcept or Extension, making it easier to locate specific information.

Facilitating Data Analysis: By providing a structured inventory of key data elements, the Vocabulary DataFrame simplifies data exploration, analysis, and the formulation of research questions.

Analogy:

Think of the Vocabulary DataFrame as a library catalog. Just as a catalog helps you find books based on author, title, or subject, the DataFrame helps you find specific data elements within the FHIR dataset based on their code system, display, or where they are used.

Example:

The Vocabulary DataFrame contains columns like:

research_study_identifiers: Linking the CodeableConcept/extension to the study. path: Showing the FHIR element where the code is used (e.g., Condition.code). system: Indicating the code system (e.g., http://snomed.info/sct). display: Providing a human-readable label for the code (e.g., Diabetes mellitus). url: Linking to a FHIR query to retrieve more information.

This structure empowers researchers to:

Quickly identify all the conditions documented in studies using SNOMED CT codes.
Explore the range of medications recorded in the dataset.
Locate specific observations related to tumor grades.

**In essence, the Vocabulary DataFrame serves as a valuable tool for researchers, providing a structured overview of the key data elements within the FHIR dataset, enabling them to effectively explore, analyze, and understand the available information. **

Retrieve vocabularies used on commonly used resources¶

When a study is submitted to the site, we survey the data and create an Observation of the data in the Study. These summary Observations are published on the server.

These summaries can inform researchers who need to formulate queries. e.g. for all studies:

what are the conditions?
what condition stages?
what are the tumor grades ?
what are the medications?
what are the document types?

We can query the data using a FHIR query. These queries leverage the fq and jq tools, so be sure to install both.

In [2]:

Copied!

!pip install fhir-aggregator-client --no-cache-dir --quiet
!fq
!pip install fhir-aggregator-client --no-cache-dir --quiet
!fq

Usage: fq [OPTIONS] COMMAND [ARGS]...

  FHIR-Aggregator utilities.

Options:
  --help  Show this message and exit.

Commands:
  ls          List all the installed GraphDefinitions.
  run         Run GraphDefinition queries.
  results     Work with the results of a GraphDefinition query.
  vocabulary  FHIR-Aggregator's key Resources and CodeSystems.

Vocabulary Dataframe¶

As a convenience, the fhir-aggregator-client's vocabulary command will query this data and save it in a local dataframe.

This generates a tab-separated file named vocabulary.tsv, which serves as an inventory of the server's data elements and their usage, including example FHIR queries to retrieve those values.

In [3]:

Copied!

%%capture [--no-stdout]
!fq vocabulary vocabulary.tsv --fhir-base-url $FHIR_BASE
%%capture [--no-stdout]
!fq vocabulary vocabulary.tsv --fhir-base-url $FHIR_BASE

This dataframe provides a catalog of the data elements present in the FHIR server, along with other useful information.

Create a dataframe from the vocabulary tsv. Note that the url field is a FHIR query that will return the resources that match that vocabulary. The documentation column links to the data dictionary documentation for that field.

Column	Description
research_study_identifiers	A comma-separated list of identifiers that uniquely identify the research studies where this data element is found. This helps to link the data element back to specific studies.
path	The FHIR resource and element where the code is used (e.g., Condition.code, Observation.valueCodeableConcept). This shows the structural context of the data element within FHIR resources.
documentation	A URL linking to the official FHIR data dictionary documentation for the specific data element. This provides detailed information about the meaning and usage of the element according to the FHIR standard.
code	The actual code value within a CodeableConcept. This is the coded representation of the data element (e.g., a SNOMED CT code for a specific disease).
display	A human-readable label or description associated with the code. This makes it easier to understand the meaning of the code without needing to look up code system definitions.
system	The code system or terminology from which the code originates (e.g., http://snomed.info/sct, http://loinc.org). This helps to identify the source and context of the code.
extension_url	If the data element is an Extension, this column contains the URL that defines the Extension. Extensions provide a way to add custom data elements to FHIR resources.
count	The number of times this specific code value was found within the FHIR data. This gives an indication of the prevalence or frequency of the data element.
low	If the data element is numeric and has a defined range, this column represents the lower bound of the range. This is helpful for understanding the possible values for numeric data elements.
high	If the data element is numeric and has a defined range, this column represents the upper bound of the range. Similar to low, this helps to understand the potential values.
research_study_title	The title of the research study associated with this data element. This provides a more descriptive context for understanding where the data element is used.
research_study_description	A brief description of the research study associated with the data element. This offers additional context for the data element's usage.
observation	The ID of the Observation resource which is used to store the vocabulary. This provides a way to see where this data element was extracted from in the FHIR server.
research_study	The ID of the research study where the vocabulary is used. This allows you to retrieve the specific study that contains the vocabulary
url	A readily available FHIR query that can be used to retrieve the resources that contain the itemized data for this code. This makes it easy to access the data for analysis.

In [4]:

Copied!

import pandas as pd
df = pd.read_csv('vocabulary.tsv', sep='\t').fillna('')
df.loc[df['research_study_identifiers'] == 'GTEX_V10']
import pandas as pd
df = pd.read_csv('vocabulary.tsv', sep='\t').fillna('')
df.loc[df['research_study_identifiers'] == 'GTEX_V10']

Out[4]:

Loading ITables v2.5.2 from the internet... (need help?)