Building a Medical Imaging Data Pipeline — Using Cloud Healthcare API and DICOM Store

Jay Jayakumar
7 min readJul 9, 2021

--

Insights from medical images are critical in improving the quality of care and health outcomes. Despite constant advances in the technology, medical imaging industry still faces interpretation errors from diagnostic images, with research showing that the false-positive rate increases with volume.

“A radiologist’s false-positive rate, and therefore specificity, is greatly impacted by experience and annual volumes.” ¹

AI-powered imaging decision support: Advancements in Artificial Intelligence (AI) and computer vision on CT scans, MRIs, and X-rays have great potential to enable faster and deeper patient insights at scale. Radiology-specific predictive models could help with spotting subtle patterns in the MRI scans, assist with making a decision on a warranted surgery vs. active surveillance/therapy based on an MRI report, or analyze a chest X-ray to provide a risk score on chances for developing pneumonia.

Photo by Anna Shvets from Pexels

It comes as no surprise that AI in the medical imaging market is estimated to rise from $21.48 billion in 2018 to a projected value of $264.85 billion by 2026, according to a report by Data Bridge Market Research.

For successful AI adoption in the imaging space, a scalable and robust data pipeline is necessary. These data pipelines allow medical images to be efficiently transferred from various source systems to the analytics applications in your cloud premise. An MRI scan report should be ready for analysis as soon as possible, but at the same time, it must also be securely sent, protecting the patient’s privacy without any loss of information. This post looks at how scalable data pipelines can be built for the analysis of medical imaging data.

Three best practices for a successful image pipeline:

  1. Standardized DICOM Images: Sharing is caring, but healthcare is still not equipped to address the inconsistencies in the data that gets shared. No algorithms can train properly or make accurate predictions if the images are being sent in an inconsistent format. In the medical imaging world, the industry has adopted DICOM (Digital Imaging and Communications in Medicine), a standard to support the data exchange and facilitate the transfer of digital medical data. DICOM format preserves metadata for analytics and converting them to other formats like JPEG is not ideal for scientific studies due to the lossy compression methods that are used. Images that can be used within the DICOM framework include X-rays, ultrasounds, radiography, MRIs, and other imaging modalities. Standardizing to the DICOM format would be a basic requirement for any image analytics pipeline.
Figure 1 Showing a decoded DICOM image from NIH’s chest X-Ray data [2] using TensorFlow library and Jupyter Notebook:

2. Continuous Data Flow using Healthcare APIs: The next step after standardization would be to move on-premise medical images to your cloud platform for ongoing analytics workflows. Downloading the datasets using a one-time bulk mode is okay to start with, but continuous delivery of images is needed to support real-time and batch workflows. The good news is that major cloud vendors have started to support the transfer and storage of digital medical data. Here, I have used Google Cloud Platform (GCP), which provides cloud healthcare API, a data storage and processing framework for healthcare data that includes medical images. As shown in Figure 2, the API adapter enables us to transfer images from a PACS (picture archiving and communication system) to a cloud-based DICOM store for further image processing to support AI/ML use cases.

Figure 2: Showing the loading of medical images from different health systems to cloud platform

3. Accelerated Predictions: Pneumothorax is a medical emergency that can be detected through X-rays and requires immediate attention. Applications detecting pneumothorax from chest X-rays need to get the predictions right and the results out fast, even if it means spinning up high-performing computing instances at it. The switch to cloud computing has enabled such quick prediction workflows. By deploying GPUs (way faster than CPUs) to support complex image processing algorithms, radiologists can get insights very quickly.

Figure 3: Showing the prediction workflow for medical images

Other design considerations:

Event-Driven Architecture: Data coming from emergency departments and urgent/stat requests from physicians require results in real-time, which means analyzing the images as soon as they land in your storage bucket. A pub/sub model allows any newly produced images to be sent to a topic immediately, and any analytics applications listening to the topic can immediately kick-start the analytics workflow as soon as the data is ready. This workflow can also retrigger if the images are not clear and when a new scan is needed.

Figure 4: Showing how the pub/sub model can help with top priority requests
# Create topic 
>>> gcloud pubsub topics create projects/XRAY_PROJECT/topics/DICOM_TOPIC
#Create subscription
>>> gcloud pubsub subscriptions create DICOM_SUB --topic=DICOM_TOPIC
#Configure analytics apps to have a pull model when ever a DICOM instance is created
>>> gcloud pubsub subscriptions pull --auto-ack projects/XRAY_PROJECT/subscriptions/DICOM_SUB

Bulk-Import: Exporting data at once is key for historical data analysis of patients. When you have tens of thousands of patient records to sift through, it will take a lot of resources and time to put together a cohort of those who have had X-rays showing pneumonia, especially with patients having more than a single X-ray image. Cloud storage buckets allow bulk import and export of data to DICOM stores using a single import operation versus programmatically doing this for every file. It will save time from going into each patient’s imaging — which may have hundreds of X-rays over time — just to look for a single impression of a diagnosis such as pneumonia. A bulk export mode is a crucial element for many healthcare systems, especially ones with large historical datasets.

#Exporting data from google cloud storage to DICOM store 
gcloud healthcare dicom-stores import gcs DICOM_STORE_ID \
--dataset=XRAY_DATA \
--location=us-central1 \
--gcs-uri=gs://chest_xray_bucket/dicom/image*.dcm

Image Compression and Decoding:

Image compression allows the reduction in the size of the image without degrading the quality too much. Compression methods supported by DICOM standards are important as long as they don’t impact the underlying information and its metadata. At the destination system, there are a couple ofways we can decode images for analysis:

  • Pydicom library allows converting DICOM files for further manipulation. It makes it easy to read these complex files into Python data frames for easy manipulation.
# Using Pydicom library for decoding dicom images 
import pydicom
dicom_data = pydicom.dcmread("file path")
  • Tensorflow IO is another library allowing handling of DICOM images for further processing. TFIO DICOM image decoders, for instance, allow the decoding of such files.
# Using Tensorflow library for decoding dicom images tfio.image.decode_dicom_image(
contents, color_dim=False, on_error=’skip’,
scale=’preserve’, dtype=tf.uint16, name=None)

Securing Patient Data Protecting Privacy: The security part is a bit tricky due to the sheer size and volume of DICOM images. We can’t just implement a complex algorithm to encrypt the images without compromising speed for analytics. The explosive growth in images and 3D are making this further complicated so it is important to find the right balance between the two.

Traffic flowing from a PACS system to a cloud-based DICOM store travels through the public internet. Cloud VPN connects the PACS system to your DICOM store and secures this traffic through an IPsec VPN tunnel. Radiology images often contain a patient’s personal data, so it is important to de-identify the images during preprocessing stages to remove any patient data from them. GCP allows de-identifying at a dataset level (using datasets.deidentify) or at a store level (dicomStores.deidentify). The Cloud Healthcare API, through its native integration with cloud audit logging, can track and monitor different accesses and actions as images follow through the system.

Figure 5: Workflow showing the encryption of traffic and de-identification within the DICOM store

What’s next? The applications of AI in the medical imaging space will continue to grow. I’m really excited about how this space is evolving and its impact on reducing physician burnouts. In addition to having a great algorithm, it is critical to enforce a standardized data pipeline for medical images that is reliable, scalable, secure, and fast, enhancing the clinical diagnosis and decision support.

Reference

  1. Research Shows Connection Between Radiology Reading Volume and False-Positive Rates in Mammography. (n.d.). Retrieved from https://www.rsna.org/news/2019/october/mammo-readings-hoff
  2. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald Summers, ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, IEEE CVPR, pp. 3462–3471, 2017 The source data is provided by the NIH Clinical Center and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC

Disclaimer: The views and opinions expressed in this blog post are my own and do not necessarily reflect the official policy or position of my employer.

--

--

Jay Jayakumar
Jay Jayakumar

Written by Jay Jayakumar

Solutions Architect at Google Cloud working at the intersection of healthcare and technology. Passionate about value creation through cloud. Views my own.

No responses yet