page_banner

news

The Large Language Model (LLM) can write persuasive articles based on prompt words, pass professional proficiency exams, and write patient friendly and empathetic information. However, in addition to the well-known risks of fiction, fragility, and inaccurate facts in LLM, other unresolved issues are gradually becoming the focus, such as AI models containing potentially discriminatory “human values” in their creation and use, and even if LLM no longer fabricates content and eliminates clearly harmful output results, “LLM values” may still deviate from human values.

 

Countless examples illustrate how the data used to train AI models encodes individual and social values, which may solidify within the model. These examples involve a range of applications, including automatic interpretation of chest X-rays, classification of skin diseases, and algorithmic decision-making regarding medical resource allocation. As stated in a recent article in our journal, biased training data may amplify and reveal the values and biases present in society. On the contrary, research has also shown that AI can be used to reduce bias. For example, researchers applied deep learning models to knee X-ray films and discovered factors that were missed by standard severity indicators (graded by radiologists) within the knee joint, thereby reducing unexplained pain differences between black and white patients.

Although more and more people are realizing the bias in AI models, especially in terms of training data, many other entry points of human values are not given enough attention in the development and deployment process of AI models. Medical AI has recently achieved impressive results, but to a large extent, it has not explicitly considered human values and their interaction with risk assessment and probabilistic reasoning, nor has it been modeled.

 

To concretize these abstract concepts, imagine that you are an endocrinologist who is required to prescribe recombinant human growth hormone for an 8-year-old boy who is below the 3rd percentile of his age. The boy’s stimulated human growth hormone level is below 2 ng/mL (reference value,>10 ng/mL, reference value for many countries outside the United States is>7 ng/mL), and his human growth hormone coding gene has detected rare inactivation mutations. We believe that the application of human growth hormone therapy is obvious and indisputable in this clinical setting.

The application of human growth hormone therapy in the following scenarios can cause controversy: a 14-year-old boy’s height has always been in the 10th percentile of his peers, and the peak of human growth hormone after stimulation is 8 ng/mL. There are no known functional mutations that can affect height, nor other known causes of short stature, and his bone age is 15 years old (i.e. no developmental delay). Only part of the controversy is due to differences in the threshold values determined by experts based on dozens of studies regarding human growth hormone levels used for diagnosing isolated growth hormone deficiency. At least as much controversy stems from the risk benefit balance of using human growth hormone therapy from the perspectives of patients, patient parents, healthcare professionals, pharmaceutical companies, and payers. Pediatric endocrinologists may weigh the rare adverse effects of daily injections of growth hormone for 2 years with the probability of no or only minimal growth in adult body size compared to present. Boys may believe that even if their height may only increase by 2 cm, it is worth injecting growth hormone, but the payer and pharmaceutical company may hold different views.

 

We take creatinine based eGFR as an example, which is a widely used renal function indicator for diagnosing and staging chronic kidney disease, setting kidney transplant or donation conditions, and determining reduction criteria and contraindications for many prescription drugs. EGFR is a simple regression equation used to estimate the measured glomerular filtration rate (mGFR), which is a reference standard, but the evaluation method is relatively cumbersome. This regression equation cannot be considered an AI model, but it illustrates many principles about human values and probabilistic reasoning.

The first entry point for human values to enter eGFR is when selecting data for fitting equations. The original queue used to design the eGFR formula is mostly composed of black and white participants, and its applicability to many other ethnic groups is not clear. The subsequent entry points for human values into this formula include: selecting mGFR accuracy as the primary objective for evaluating kidney function, what is an acceptable level of accuracy, how to measure accuracy, and using eGFR as a threshold for triggering clinical decision-making (such as determining conditions for kidney transplantation or prescribing medication). Finally, when selecting the content of the input model, human values will also enter this formula.

For example, before 2021, guidelines suggest adjusting the creatinine levels in the eGFR formula based on patient age, gender, and race (only classified as black or non black individuals). The adjustment based on race is aimed at improving the accuracy of the mGFR formula, but in 2020, major hospitals began to question the use of race based eGFR, citing reasons such as delaying the patient’s eligibility for transplantation and concretizing race as a biological concept. Research has shown that designing eGFR models in terms of race can have profound and varying impacts on accuracy and clinical outcomes; Therefore, selectively focusing on accuracy or focusing on a portion of outcomes reflects value judgments and may mask transparent decision-making. Finally, the national working group proposed a new formula that was re fitted without considering race to balance performance and fairness issues. This example illustrates that even a simple clinical formula has many entry points into human values.

Doctor with virtual reality in operation room in hospital.Surgeon analyzing patient heart testing result and human anatomy on technological digital futuristic virtual interface,digital holographic, innovative in science and medicine concept.

Compared to clinical formulas with only a small number of predictive indicators, LLM may consist of billions to hundreds of billions of parameters (model weights) or more, making it difficult to understand. The reason why we say “difficult to understand” is that in most LLMs, the exact way of eliciting responses through questioning cannot be mapped. The number of parameters for GPT-4 has not been announced yet; Its predecessor GPT-3 had 175 billion parameters. More parameters do not necessarily mean stronger capabilities, as smaller models that include more computational cycles (such as the LLaMA [Large Language Model Meta AI] model series) or models that are finely tuned based on human feedback will perform better than larger models. For example, according to human assessors, the InstrumentGPT model (a model with 1.3 billion parameters) outperforms GPT-3 in optimizing model output results.

The specific training details of GPT-4 have not been disclosed yet, but the details of previous generation models including GPT-3, InstrumentGPT, and many other open-source LLMs have been disclosed. Nowadays, many AI models come with model cards; The evaluation and security data of GPT-4 have been published in a similar system card provided by the model creation company OpenAI. The creation of LLM can be roughly divided into two stages: the initial pre training stage and the fine-tuning stage aimed at optimizing the model output results. In the pre training stage, the model is provided with a large corpus including the original Internet text to train it to predict the next word. This seemingly simple “automatic completion” process produces a powerful foundational model, but it can also lead to harmful behavior. Human values will enter the pre training stage, including selecting pre training data for GPT-4 and deciding to remove inappropriate content such as pornographic content from the pre training data. Despite these efforts, the basic model may still be neither useful nor capable of containing harmful output results. In the next stage of fine-tuning, many useful and harmless behaviors will emerge.

In the fine-tuning stage, the behavior of language models is often profoundly altered through supervised fine-tuning and reinforcement learning based on human feedback. In the supervised fine-tuning stage, hired contractor personnel will write response examples for prompt words and directly train the model. In the reinforcement learning stage based on human feedback, human evaluators will sort the model output results as input content examples. Then apply the above comparison results to learn the “reward model” and further improve the model through reinforcement learning. Amazing low-level human involvement can fine tune these large models. For example, the InstrumentGPT model used a team of approximately 40 contractor personnel recruited from crowdsourcing websites and passed a screening test aimed at selecting a group of annotators who are sensitive to the preferences of different population groups.

As these two extreme examples, namely the simple clinical formula [eGFR] and the powerful LLM [GPT-4], demonstrate, human decision-making and human values play an indispensable role in shaping model output outcomes. Can these AI models capture their diverse patient and physician values? How to publicly guide the application of AI in medicine? As mentioned below, a reexamination of medical decision analysis may provide a principled solution to these issues.

 

Medical decision analysis is not familiar to many clinicians, but it can distinguish between probabilistic reasoning (for uncertain outcomes related to decision-making, such as whether to administer human growth hormone in the controversial clinical scenario shown in Figure 1) and consideration factors (for subjective values attached to these outcomes, whose value is quantified as “utility”, such as the value of a 2 cm increase in male height), providing systematic solutions for complex medical decisions. In decision analysis, clinicians must first determine all possible decisions and probabilities associated with each outcome, and then incorporate the patient (or other party) utility associated with each outcome to select the most appropriate option. Therefore, the validity of decision analysis depends on whether the outcome setting is comprehensive, as well as whether the measurement of utility and the estimation of probability are accurate. Ideally, this approach helps ensure that decisions are evidence-based and aligned with patient preferences, thereby narrowing the gap between objective data and personal values. This method was introduced into the medical field several decades ago and applied to individual patient decision-making and population health assessment, such as providing recommendations for colorectal cancer screening to the general population.

 

In medical decision analysis, various methods have been developed to obtain utility. Most traditional methods directly derive value from individual patients. The simplest method is to use a rating scale, where patients assess their level of preference for a certain outcome on a digital scale (such as a linear scale ranging from 1 to 10), with the most extreme health outcomes (such as complete health and death) located at both ends. The time exchange method is another commonly used method. In this method, patients need to make a decision on how much healthy time they are willing to spend in exchange for a period of poor health. The standard gambling method is another commonly used method for determining utility. In this method, patients are asked which of the two options they prefer: either live a certain number of years in normal health with a specific probability (p) (t), and bear the risk of death with a 1-p probability; Either make sure to live for t years under cross health conditions. Ask patients multiple times at different p-values until they show no preference for any option, so that utility can be calculated based on patient responses.
In addition to methods used to elicit individual patient preferences, methods have also been developed to obtain utility for the patient population. Especially focus group discussions (bringing patients together to discuss specific experiences) can help understand their perspectives. In order to effectively aggregate group utility, various structured group discussion techniques have been proposed.
In practice, the direct introduction of utility in the clinical diagnosis and treatment process is very time-consuming. As a solution, survey questionnaires are usually distributed to randomly selected populations to obtain utility scores at the population level. Some examples include the EuroQol 5-dimensional questionnaire, the 6-dimensional utility weight short form, the Health Utility Index, and the Cancer Specific European Cancer Research and Treatment Organization Quality of Life Questionnaire Core 30 tool.


Post time: Jun-01-2024