From Causal Understanding of Healthy Longevity to Smart Health
1. Introduction
Dramatic breakthroughs in medicine, public health, and social and economic development have been associated with unprecedented lengthening of the human lifespan across the world over the past century, including in Vietnam and the United States. At the current nature and pace, population aging is poised to impose a significant strain on economies, health systems, and social structures worldwide. But this need not be.
The U.S. National Academy of Medicine (NAM) and related organizations worldwide envision an explosion of potential new medicines, treatments, technologies, and preventive and social strategies that transform the way we age and ensure better health, function, and productivity during a period of extended healthy longevity. To transform our society from the gains of just longevity in the past century to healthy longevity in the forthcoming century, however, it is important to understand not just the association between healthy longevity and these various medical, public health, and social improvements, but also the causal relationship. This will allow us to tease apart how various medical, public health, and social factors are causally related to each other in longitudinal data, to infer counterfactuals in what might happen under various interventions (and how this generalizes), and to design optimal causal interventions. Notably, phenotype is a complicated function of genotype, environment, and interventions that must be understood.
New data collection efforts in the U.S. are capturing multivariate longitudinal data that describe the complex system of human health and various interventions, and we propose to build platforms for similar data collection in Vietnam that expand on several existing datasets such as the Vietnam National Survey. Notably, the “All of Us” research program from the U.S. National Institute of Health (NIH) is inviting one million people across the U.S. to help build one of the most diverse longitudinal health databases in history. The aim is to learn how our biology, lifestyle, and environment affect health, and data will be available to all researchers, including the proposed VISHC team. Now is the time to advance causality methods in services of smart health.
Indeed, since solutions are urgently needed to maximize the number of years lived in good health and a state of well-being, now is the time to causally characterize the next breakthroughs in healthy longevity, so we can all benefit from the tremendous opportunities that a variety of potentially causally interlinked interventions can offer. Proposed work has several intertwined methodological research thrusts.
- Causal structure discovery to disentangle how various factors interact with one another in the complex system of human health evolution over time.
- Counterfactual inference to estimate the treatment effect at the individual patient level, thereby allowing personalized decision (treatment), risk assessment, and prevention.
- Causal evaluation framework to determine how to evaluate the performance or generalization of a causal model.
- Causal intervention optimization where adjustable properties of interventions, such as thresholds for action, can be optimized based on their causal impacts.
These methodological thrusts will be brought together for cross-cutting healthy longevity applications. Bringing thrusts and cross-cuts together will provide significant insight and design principles into the causal structure of healthy longevity by advancing causality methodology and applying it to large-scale longitudinal health data.
The remainder of the proposal outlines these research thrusts and provides a student recruiting / co-mentoring plan and future collaboration plan between VinUni and Illinois. For brevity, references are largely omitted, but su”ce it to say that all of the PIs have deep expertise and significant past work that will be built upon.
2. Causal Structure Discovery
Causal structure discovery is predicated on the premise that causality is essentially a symmetry breaking attribute. That is, the past affects the future but the future does not affect the past. Discerning this from data is non-trivial, since most existing techniques are excellent at capturing the symmetric dependencies through metrics such as correlation or mutual information. To overcome this challenge, “asymmetric” information-theoretic measures have been established. This includes the characterization of dependencies using directed acyclic graphs (DAGs). Drawing on recent work in another field (Kumar et al. 2023), we pose the following hypothesis: causal structure of interaction among driving factors determines the functional response (like a mental health disorder) of the human health system, and different functional responses are attributable to different causal structure. In other words, causal structures are uniquely associated with functional outcomes and adverse functional outcomes can be traced back uniquely to structures of causal interactions.
Our goal here is to take this framework further using generative AI algorithms like Transformers, for multivariate systems. Present advances in Transformers do not allow causal analyses, even though the causal dependencies are encoded. We propose to modify Transformer structure to allow encoding/decoding of causal dependencies. The major advantage of this approach will be that we can discern cause-effect relationships from noisy and heterogeneous data. This will be further combined with counterfactual analysis as indicated next.
3. Counterfactual Inference
3.1. Problem Motivations
Machine learning, especially deep neural networks (DNN), have rapidly advanced and transformed our daily lives in various fields and applications, including healthcare. However, in applications such as precision medicine, predicting a disease with perfect accuracy is not our aim. Our goal is to build machine learning methods with actionable decisions at the patient level, i.e., estimating the treatment effect and its significance. This Individual Treatment effect (ITE), given the treatment, is the difference between two potential outcomes: the factual outcome and the counterfactual outcome. Estimating ITE allows us to answer the what-if question “What would have happened if this patient had received alternative treatment X?”, equivalently allowing us to make a personalized decision (treatment), risk assessment, and prevention. Existing counterfactual inference methods are not suitable for modern healthcare applications due to the following challenges.
- Multi-modal representation learning: Health data is high-dimensional and multimodal. We must integrate data from modalities such as image, text, and genomics to embed them in a shared semantic space with disentangled/compositional representations where information can be combined to estimate counterfactual effect.
- Complex high-level interaction: Estimating ITE with complex data representations requires more flexible models. Existing works in counterfactual inference rely on simple linear functions and cannot capture high-level interactions or compositional concepts of the input. Interventions or treatments usually act on (or are influenced by) these high-level concepts, rather than purely on (or by) the input raw data. For example, the complex interaction between environmental factors and genetics can greatly influence the treatment effect for obesity.
- Distributional shift: Learning a counterfactual model for each environment, e.g., a single hospital, increases the potential to learn spurious correlations, resulting in models that may not generalize well to new hospitals. We must capture a causal structure that is invariant to distributional changes among environments. For example, estimating the effectiveness of the second dose of COVID-19 vaccine should depend on the patient and the vaccine manufacturer, but not on the location where the vaccine was taken.
3.2. Research Contributions
To tackle these challenges, we aim to extend existing counterfactual methods to the healthy longevity domain. We develop counterfactual inference algorithms that incorporate modern multi-modal and complex healthcare data, sampled from our proposed causal framework. We first propose a comprehensive and flexible representation learning approach composed of two levels: intra-modal and inter-modal. We aim to extend existing single-modal representation learning in healthcare to knowledge-enhanced multi-modal representation by constructing a shared common semantic space among multiple data modalities: (1) 2D medical images; (2) text-based genetic descriptors; (3) text-based clinical notes; (4) structured properties from demographic data; and (5) time-series properties from EHR coding data. Then, we propose to learn compositional and invariant structured representations via adversarial learning. While existing representation-learning methods constrain both data representations and models to be linear our proposed approach can capture nonlinear and complex representations of the data. This has the potential to vastly improve representation learning and expand these tools to other domains
Figure 1: Multi-modal Causal Evaluation Framework.
The proposed multi-modal representation learning model, represented as neural networks, is learned by balancing 3 objectives: (i) low prediction error on the factual outcomes, (ii) low prediction error on the counterfactual outcomes, (iii) the distribution of the treatment population are similar or balanced. The distance between the factual and counterfactual distributions is based on how much the factual and counterfactual outcomes may disagree. Any discrepancy measures, such as Maximum Mean Discrepancy, can be used. However, instead of selecting a discrepancy measure that may introduce inductive biases for the inference task, our proposed adversarial training framework will allow searching for the representations that are indistinguishable from the two distributions.
Besides counterfactual inference, the proposed adversarial learning framework also o!ers a means of “explaining” the response (i.e., the ITE) by inspecting the counterfactual samples, an approach called counterfactual explanation. Specifically, we can search for small perturbations to specific parts of the input (such as environmental factors) that result in a noticeable change in ITE (such as a significant reduction in diabetes risk). These explanations prove valuable for both patients and healthcare professionals.
4. Causal Evaluation Framework
4.1. Problem Motivations
It is challenging to evaluate the performance or generalization of a causal model (or counterfactual model)on real-world healthcare data, due to the nature of the problem on causal inference and the multi-modal, complex healthcare data. A suitable causal evaluation framework is crucial in large-scale implementations of causal models in healthcare, where any potential interventions are conducted directly on human patients. Shortcomings of the existing evaluation process of causal models and challenges in healthcare are as follows.
- Lack of suitable, real-world datasets: We generally do not have access to the ground-truth causal effect or counterfactual answers. Existing methods rely on either synthetic or semi-synthetic datasets for the empirical evaluation of the models. However, synthetic datasets may give us an unreliable generalization to real-world settings, where there exist non-linear and complex relationships between variables. Existing semi-synthetic datasets are also not suitable as replacements of real-world evaluation since they are single-modal and small-scale. Finally, dataset designers should consider important issues such as “missing-not-at-random” that may cause spurious correlation in the data collection process.
- Multi-modal healthcare data: Modern healthcare includes multi-modal, high-dimensional, and often unstruc-tured data. Diseases such as Alzheimer’s, diabetes, and cancer, have causes arising from complex interactions between environmental and genetic factors. Features from medical images, genomics, clinical notes, and demo-graphics must be considered in both the causal models and the evaluation framework. Further, several diseases (e.g., chronic illnesses such as diabetes, rheumatism, or ischemic heart disease) progress over time, a dimension that must be considered in causal studies. Collecting multi-modal healthcare data is, however, very challenging since it requires close collaborations between causal inference researchers and healthcare practitioners.
4.2. Research Contribution
We aim to develop an extensive evaluation platform, as in Figure 1, to collect a variety of large-scale healthcare datasets to allow researchers to study and evaluate their developed causal-learning algorithms and models. Our
platform will facilitate discoveries of causal and counterfactual methods for taking actionable decisions in real-world healthcare environments. The detailed research objectives are as follows.
- Evaluation Framework for Causal Model Development and Testing: To reliably evaluate causal models and ensure their generalization to real-world health applications, we first propose to standardize the evaluation process of causal models in healthcare. We will develop a framework that allows causal learning researchers to model, train, and evaluate their causal methods on real-world collected healthcare data, including Transformer models. We focus on three important considerations in designing the framework: dataset design, awareness of confounding/proxy variables, and removal of spurious correlation (in the second objective). The specification of datasets covers several types of causal models, including structured-causal and counterfactual models where there exist both confounding and proxy variables. When training causal models, the identities of these variables can be hidden. For example, for some chronic illnesses such as diabetes and hyperlipidemia, socioeconomic status and last year’s income as confounding variables can influence the treatments and corresponding outcomes. We also consider two other important factors: missing-at-random and cohort design, which are previously associated with creating spurious correlations in the data.
- Connecting Randomized-Controlled-Trial and Causal Evaluation: Most existing healthcare causal datasets are observational or experimental, resulting in modeling bias due to confounding and proxy variables. We propose to rely on the principles of Randomized Controlled Trial (RCT) and Response-Adaptive Randomization data collection. Specifically, randomized data collection should be conducted whenever possible. For example, patient data should be collected from randomly selected environments (e.g., hospitals in our network) or with different demographics (e.g., socioeconomic status) to remove possible spurious correlations. The purposes of this design are twofold. First, we can confirm the existence of a causal relationship between the treatment and outcome, removing the modeling bias. Second, such data collection allows the causal model to study real, interesting phenomena, encouraging its generalization to real-world settings.
- Multi-modal Data Collection: We propose to collect and represent knowledge with health data from multiple modalities, including medical images, genomics, clinical notes, and demographics. Each modality may have only a portion of the overall causal structure. Inadequate data collection from multiple modalities can generate spurious inconsistencies within and across modalities, and can be hard to integrate and fuse. Existing approaches focus on single modalities and cannot distinguish causal relationships from innocuous ones between modalities.
- In-house Existing Multi-modal Data: Collecting a diverse longitudinal health database is challenging, time-consuming and requires a huge e!ort and resources. We propose to evaluate the proposed causal models and our platform on in-house datasets provided our colleagues at VinUni, VinUni-Illinois Smart Heath Center, VinMec Hospital System and VinBigData. This resource includes the longitudinal Bone Cancer Biobank at VinMec Hospitals, 2D/3D medical images (https://vindr.ai/), clinical text data, and lifestyle from VAIPE project (https://vaipe.org/) and survey data from Cancer Wellness Program (https://cwpvietnam.com/). The compilation of these diverse, and multimodal medical datasets provides a unique and extensive empirical setting for the development of causal learning methods within the medical domain.
The causal evaluation framework allows researchers in causal inference a reliable and large-scale platform for their study. While our data collection process will first be realized in Vietnam’s hospital network, the principles behind and successes of our design can be used in other hospital networks all over the world, further promoting reliable studies of causal learning in healthcare. Notably, the data collected through the platform will be used in the other thrusts of the proposed research
5. Causal Intervention Design
5.1 Problem Motivations
Whether one receives an important health intervention on the basis of a fairly arbitrary blood pressure threshold is linked to significant di!erences in health outcomes and mortality; patients on either side of the threshold may be largely similar otherwise. Similar arbitrary discontinuities and cohort assignments abound in numerous public policy, economics, and healthcare settings that may cause significant impacts. Indeed, categorization on the basis of fairly arbitrary partitioning of certain attributes abounds in social life. Discontinuities and other interventional clusterings can be fairly arbitrary. Indeed, their arbitrariness is why they have been used to infer causal rela-tionships among variables in numerous settings; regression discontinuity from econometrics assumes the existence of a discontinuous variable that splits the population into distinct partitions to estimate the causal effect of a given phenomenon. Several settings in health and its social determinants, however, might allow the (re)design and optimization of thresholds to create new discontinuities or clusterings, and hence have people be subject to the effects of different partitions. Proposed research addresses this lacuna at the intersection of causal inference and mechanism design, in the service of healthy longevity.
5.2. Research Contribution
To take an example of discontinuity design, we can consider the design of partitions for a given discontinuous variable to optimize a certain effect previously studied using regression discontinuity with a characterization of counterfactual impact. To do so, we propose a quantization-theoretic approach to optimize the effect of interest, first learning the causal effect size of a given discontinuous variable and then applying dynamic programming for optimal quantization design of discontinuities that balance the gain and loss in the effect size. We will also develop a computationally-efficient reinforcement learning algorithm for the dynamic programming formulation of optimal quantization. We would initially demonstrate our approach by designing optimal discontinuities for counterfactuals of social capital, social mobility, and health through a spatial structure, but would then extend the approach to apply to longitudinal data. Similarly, we aim to develop similar techniques at the intersection of causal inference and optimal quantization theory to design mechanisms with other kinds of interventional designs. Information-theoretic characterizations of health factors that emerge from causal structure discovery may be especially useful for optimal causal intervention design.
6. Healthy Longevity Applications
Our proposal opens a wide array of applications across various clinical scenarios, with the following key applications.
- Personalized and optimal treatment: The proposed causal platform will be used to develop personalized treatment plans for individuals based on their specific health profiles. It can help identify which treatment or intervention is most likely to benefit a specific patient (called as “personalized”). By analyzing causal relationships in patient data, clinicians can estimate how a patient’s health will respond to different treatment options and then make more informed decisions about the most suitable treatment options. The system can also help design more effective preventive measures and lifestyle interventions.
- Causal relationships between risk factors and outcomes: Our models can be used to learn causal relationships between risk factors (such as lifestyle choices, genetics, environmental factors, and medical treatments) and outcomes (like diseases, recovery rates, or quality of life), help identify and understand the causal effect of different variables on outcomes of interest. This information is crucial for making informed decisions about preventive measures, treatments, and health/social policies.
- Explainable AI (XAI) with causal and counterfactual explanations for CADs: XAI systems aim to make AI models more transparent and interpretable so that their decisions and reasoning can be understood by humans. It is especially important in healthcare applications. Our approach can improve transparency and interpretability in XAI and CAD systems by explaining AI decisions based on causal relationships. It helps identify important features, mitigates bias, and supports counterfactual reasoning. In healthcare CAD, it optimizes treatment recommendations and aids in ethical decision-making. The proposed causal platform enhances model robustness and generalization, leading to more reliable AI systems. However, its application should consider domain-specific context and expertise for accurate causal relationship inference.
- Student Recruiting and Co-advising Plan / Collaboration Plan
For VinUni students in the VISHC pool, Profs. Doan, Pham, and Buntine will take the lead in recruiting top-notch candidates with interests in the intersection of causality and smart health in service of healthy longevity. The fact Prof. Buntine is the director of the Computer Science program and that Prof. Pham is the associate director of the VISHC will facilitate such recruiting. For Illinois students in the VISHC pool, the opposite holds, where the fact Prof. Do leads VISHC and the fact Prof. Varshney is well-tied to the NAM Healthy Longevity ecosystem will facilitate recruiting. Before finalizing offers, VinUni and Illinois teams will consult each other to ensure mutual agreement on the best candidates. Student travel/residence between Vietnam and Illinois will be governed by standard VISHC procedures of 2-2-1 years, but co-advising will involve close coupling among the PIs through a regular teleconferencing schedule and ad hoc in-person visits. Although the entire project team will work closely on different facets of proposed research (including some “follow the sun” schedules that might enable fast progress), each trainee will be assigned two specific mentors—one on each side—so as to have strong accountability.
Beyond the work proposed herein, PIs from VinUni and Illinois will also collaborate on proposals for external funding (e.g. NIH and NAM in the U.S., and Gates Foundation and Grand Challenges with strong global funding programs for personalized medicine and medical treatments in LMICs), which will help solidify long-term relationships and research/teaching impact. Prof. Varshney is involved in strengthening U.S.-Vietnam relations on AI at the government level, building on his work at the White House, which will also be brought to bear to this effort.