Humans and machines: Ethical collaborations in evaluation

Humans and machines: Ethical collaborations in evaluation
By Petra van Nierop and Katerina Mantouvalou
Vice President, Lead Analyst
Vice President, BD Operations
Oct 8, 2024
7 MIN. READ

Ethical collaborations between humans and AI in EU policy evaluations at ICF enhance data collection, analysis, and dissemination, while addressing challenges like bias and data privacy through robust governance and transparency.

As public policy evaluators at ICF, predominantly working for the European Commission and other EU and international institutions, we have been eagerly following the rise of AI. Our collaboration with data scientists has yielded some very promising results, yet also brought some challenges and dilemmas to the fore, which we share in this article.

The unique features of EU evaluations

EU evaluations usually require collecting data in all Member States of the EU (and beyond, depending on the evaluation object), which means covering 27 different national contexts and nearly as many different languages. Also, while EU evaluations are usually extremely data rich (think, for example, of large databases compiling details on all projects for EU programmes like Horizon or Erasmus+), there are significant data comparability and quality issues. Evaluations of EU policies, programmes, and legislation also always require a mixed-method, multi-source approach. Evaluating EU policies, programmes, and legislation thus comes with some rather unique challenges that, at least in part, AI can help tackle successfully. Also, EU evaluations often address pressing social, economic, and environmental challenges, and their timely completion is crucial to serve the public interest. Incorporating AI can help meet public interest more efficiently.

Applying AI in evaluations

Although AI is not a new phenomenon, it gained significant public attention with the advent of technologies like ChatGPT. The term “machine learning” was coined as early as in 1959 by Arthur Samuel, a computer gaming and AI pioneer at IBM, and the first natural language processing (NLP) techniques date back to the late 1950s as well.

At ICF, we have been doing evaluations and research for many years and have been using artificial intelligence technologies in our work since the early 2000s. Starting with text mining software that involved some degree of machine learning, and progressing to the use of sentiment analysis to collect and analyse data in evaluations by the mid-2000s, we know the strengths and weaknesses of these technologies—and how to employ them advantageously.

The most recent AI technologies offer a raft of new and exciting possibilities for conducting evaluations, as not only can AI process vast amounts of data within a very short time period; it can also be used to support data collection, assist in more complex triangulation and analysis, and even help with dissemination. On the other hand, its use also raises some new ethical dilemmas and comes with new risks.

From everyday to transformative AI solutions

To date, at ICF we have mostly been applying AI tools and technologies in evaluations that we would categorise as “everyday AI use” to support, in particular, certain types of data collection and analysis, including surveys, interview write-ups, social media analysis, etc, using off-the-shelf large language models (LLM). These yield important benefits in terms of analysing large volumes of data and delivering quick insights that would not have been achievable otherwise, within the available amount of time and resources.

Increasingly we have also been rolling out more “transformative AI uses” which offer important advantages as they can be trained with vast factual information on the wider context, so that LLM understands the evaluation object better. This deeper understanding of the evaluation object is achieved through the use of tailormade AI software, as opposed to the more generic, off-the-shelf models used in everyday applications. Tailormade AI systems are specifically trained and fine-tuned to address particular evaluation questions, thereby enhancing their relevance and accuracy. By tweaking the algorithms to suit the unique context and objectives of each evaluation, these bespoke AI solutions can offer more precise and insightful analyses compared to their off-the-shelf counterparts. This helps the model focus on retrieving information that is highly relevant, avoid so-called “hallucinations,” and provide much better and more complete responses to queries. This process has allowed us, for example, to obtain a highly reliable analysis per evaluation question of hundreds of programme reports. It has also enabled us to formulate specific follow-up queries to obtain more in-depth information on trends and developments that seemed of interest from the initially generated analysis.

AI technologies offer tremendous opportunities as not only can they support data collection and initial analysis; if trained well, these tools can also help with more sophisticated analytical techniques such as trend analysis, predictive modelling and foresight, and real-time evaluation. In the U.S. we have even been developing—but not as part of an evaluation— a chatbot able to interact with users and answer queries on HIV by retrieving documentation relevant to their query in a fully private and anonymous way. Possibly, such technology could in the future also be considered for surveying people about sensitive subjects.

Ethical considerations and risk mitigation strategies

Using AI, whether for professional or personal purposes, is accompanied by well-documented risks. The risks of gender, race and cultural bias, hallucinations as well as data privacy and security breaches are the most quoted issues, and they are also extremely relevant in evaluations.

Based on our experience, here are some risks of using AI in evaluations and our strategies for mitigating them.

1. Ensuring human oversight to mitigate bias and hallucination risks

One of the most pressing issues is the risk of bias, which can lead to unbalanced and unfair judgements in evaluations, especially with regard to minoritised and vulnerable groups, and to analyses which do not reflect contextual subtleties, e.g., in terms of political and cultural understanding. Another related risk to avoid is AI producing inaccurate findings (also called ‘hallucinations’).

It is crucial to have deep technical understanding of AI capabilities and limitations, to make sure that the data used to train AI systems is diverse and representative, and monitored / audited continuously if used over a longer time period. It is equally important to ensure human oversight and involvement in all stages of data collection, analysis, triangulation, and synthesis. While the efficiency gains of AI are significant, it does not have the subtle understanding and context that only humans can provide. For example, analyses produced by AI should be double-checked for bias and the factors contributing to it identified, and where possible human test analyses should be undertaken to compare results.

A collaborative approach between evaluators, data scientists, policy specialists, and ethical experts throughout the evaluation cycle works best to mitigate bias and hallucination risks.

2. Establishing a robust governance framework for data privacy and security

Any evaluation that includes stakeholder consultation, whether using AI or not, should make sure that all those interviewed, surveyed, participating in workshops, etc. are providing informed consent in relation to their participation. However, AI has brought to light a few new challenges, such as whether consent is needed from social media users, when using AI for sentiment analysis on platforms like X, Facebook, or LinkedIn, or whether evaluation stakeholders should be made aware that the information they provide will be analysed by an AI tool.

At ICF, we rely on responsible use principles that set out privacy information notices and consent forms to ensure that data used comply with licence terms of social media/web platforms or intellectual property. This governance framework is overseen by an Internal AI Review Board within the organisation that reviews and approves the use of AI in evaluation projects (or parts of the projects) that pose a high risk (e.g., AI chatbots that interact with vulnerable groups).

3. Fostering trust through transparency and accountability

In evaluations, it is essential that the evaluator can explain the link between the judgement and the evidence base, but this can be a challenge with off-the-shelf tools that are designed for everyday AI use.

This is why the choice of AI tool is very important, as there are models that offer interfaces that allow tracing the interpretive process. Increasingly, however, we prefer working with our own fine-tuned models, so that our data scientists, when developing AI solutions, make sure that all algorithms and analytical steps are well documented and explained also to the non-expert reader.

An ethical approach to AI

In summary, our comprehensive approach integrates ethical considerations, rigorous technical oversight, and a strong commitment to transparency. By fostering continuous learning and improvement, we ensure that our AI systems enhance human judgment while maintaining ethical integrity and inclusivity.

Meet the authors
  1. Petra van Nierop, Vice President, Lead Analyst

    Petra is a European public policy expert with more than 20 years of experience as an evaluator and researcher of European Union policies, legislation, programs, and actions. View bio

  2. Katerina Mantouvalou, Vice President, BD Operations

    Katerina leads research, evaluation, and analytical services in gender equality and human rights for the European Commission and its agencies, the World Bank, and the United Nations. View bio

Subscribe to get our latest insights