Working with Sensitive Data and LLMs

Sarthak Arora & Marcus Zethraeus
January 12, 2024

In an era where data is as critical as currency, its potential to unlock transformative insights is unparalleled. Yet, with great power comes great responsibility, especially when the data in question is sensitive by nature. The healthcare sector, where patient data is both invaluable and confidential, stands at the forefront of this conundrum. The integration of Large Language Models (LLMs) such as ChatGPT and advanced code interpreters in medical data analysis presents a promising yet precarious frontier that demands a nuanced approach. It is imperative to tread carefully, ensuring the safeguarding of Personally Identifiable Information (PII) to maintain the ethical use of data and protect individuals' privacy.

Insights are Valuable, Sharing Data can be Risky

The insights gleaned from data analytics are the driving force behind personalised medicine, operational efficiency, and drastic potential improvements of patient outcomes through concepts such as genetic twinning. However, the sensitive nature of medical data means that the stakes for privacy and security are sky-high and many data owners are unwilling to share data outside of their own environments. Data breaches have had, and continue to have, devastating consequences, ranging from the violation of patient privacy to legal and financial repercussions for healthcare providers. The biggest threat often facing executives here though, is the risk of loss of operational continuity in patient care if data is held to ransom, and this often drives some of the highest ransomware demands.

Sensitive data extends beyond the confines of the medical domain and permeates various sectors, notably in finance and beyond. While medical information is undoubtedly crucial and requires stringent safeguards, similar considerations are paramount in other fields where sensitive data plays a pivotal role. In the financial sector, for instance, individuals entrust institutions with a wealth of personal and financial information, necessitating robust security measures to protect against unauthorised access and potential misuse. Similarly, the legal domain harbours sensitive data encompassing confidential case details, client information, and privileged communications. As technology continues to advance, ensuring the confidentiality and integrity of sensitive data remains a universal challenge that demands proactive measures and vigilant protection mechanisms across diverse sectors.

To navigate this minefield, the implementation of robust security protocols is essential. Encryption, stringent access controls, and sophisticated data anonymization processes are some of the shields that protect the sanctity of patient data. It is also important to consider the evolving and varied data regulations globally, and the differing ability of institutions dependent on their size and financial situation to adhere to these. This often leads to more well-developed countries not sharing their data, leading to a lack of data available to train models & unlock new insights.

Using LLMs on Medical Datasets

The rise of LLMs heralds a new era in medical data analysis. These models can sift through vast amounts of medical literature, synthesise patient information, and even assist in diagnostic processes. Their ability to process natural language can make them invaluable allies for healthcare professionals who need to distil complex medical data into actionable insights.

At the intersection of healthcare and technology, code interpreters are the unsung heroes that facilitate the creation of machine learning models. These tools enable data scientists and healthcare professionals to collaborate seamlessly, translating medical datasets into algorithms that can predict patient outcomes, identify disease patterns, and personalise treatments.

Protecting data privacy is a critical consideration in today's digital landscape. There are several options and strategies to safeguard data privacy, and two key approaches are anonymization of data and deploying local machine learning models. But there are multiple challenges with using these approaches:

Anonymization:  Anonymization involves removing or modifying personally identifiable information (PII) from a dataset, making it challenging to associate specific information with an individual.


     - Achieving a balance between preserving utility and protecting privacy.

     - The risk of re-identification if not done properly.

     - Maintaining data quality and usefulness for analysis after anonymization.

Local/On-Premise LLMs: Deploying LLMs on local servers or on-premise infrastructure, as opposed to relying on cloud-based solutions.


     - Scalability: On-premise solutions may face challenges in scaling resources compared to cloud-based options.

     - Maintenance and Updates: Organisations are responsible for maintaining and updating hardware and software components, requiring dedicated resources.

     - Initial Infrastructure Costs: Setting up and maintaining on-premise infrastructure may involve higher initial costs compared to cloud solutions.

One can train models locally/on-premise using GPT implementations which run either locally, or hosted on private cloud servers; for instance, tools such as PrivateGPT would contain all capabilities of ChatGPT without compromising on security, since it would run locally.

Synthetic Data: 

Another approach to work with Sensitive data is using Synthetic Data to train Machine Learning models. Synthetic data refers to artificially generated data that mimics the characteristics of real-world data but is not derived from actual observations. The purpose of creating synthetic data is to preserve privacy, confidentiality, or proprietary information while still allowing for analysis, testing, or training of machine learning models. It can be particularly useful in situations where access to real data is restricted or when there are concerns about data privacy and security. 

Synthetic Datasets follow a similar distribution of the population from the real data. There are multiple approaches which can be used to generate Synthetic Data: 

  1. Rule-based methods: In this approach, synthetic data is generated based on predefined rules and patterns derived from the original data. For example, if the original data has a certain distribution or statistical properties, these characteristics can be replicated in the synthetic data.
  2. Statistical methods: Statistical methods involve using mathematical models to capture the statistical properties of the original data. Techniques such as bootstrapping, Monte Carlo simulations, and copula models may be employed to generate synthetic data that closely resembles the statistical properties of the real data.
  3. Machine learning-based methods: Some platforms leverage machine learning models to generate synthetic data. These models can learn the underlying patterns and relationships in the real data and then generate synthetic data with similar characteristics. Generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are commonly used for this purpose. 

There are several commercial services that provide synthetic data models as a service, such as Syndata and, as well as open source libraries for synthetic data.


The synergy between sensitive data and LLMs, powered by sophisticated code interpreters, is poised to redefine the healthcare landscape. The insights that can be harvested from medical datasets have the power to transform patient care, making it more personalised, efficient, and effective. However, the journey towards this bright future must be paved with ethical considerations, robust data protection, and a commitment to upholding the privacy of patients. As we continue to explore the vast potential of LLMs and machine learning in healthcare, the goal is clear: to harness the power of technology to heal, without causing harm. 

In an upcoming blog article, we will do a deep-dive into the technicals of how one can train models and perform EDA on top of Synthetic Data.

Learn more