Story

Natural Language Processing to align national plans in Serbia with Global Goals

17 October 2018

How we used Natural Language Processing to support Serbia in mainstreaming the SDGs into national and subnational planning

The Republic of Serbia is on a mission to map the work that is underway to implement the 17 Sustainable Development Goals (SDGs). As a first step, the UN Country Team in Serbia reviewed the compliance of the country’s policy framework with the 169 targets and 230 indicators of the 17 SDGs and assessed the country’s readiness to proceed with their implementation. The fact that Serbia is a candidate country for the accession to the EU, the Republic of Serbia called for a special review of the compliance and complementarities between 35 EU accession negotiation chapters that are implemented through a series of ongoing reform processes and linkages to Agenda 2030.

We conducted a real time comprehensive analysis in late 2017 to early 2018 using UNDP’s developed methodology called RIA – Rapid Integrated Assessment (RIA). The objective of the assessment tool is to support countries in mainstreaming the SDGs into national and subnational planning, by helping assess their readiness for SDG implementation.

We engaged a dozen of national experts who successfully reviewed over 100 of national policy documents, including those setting the targets relevant for the EU accession. Most of the documents were in Serbian, with a few available in English. The exercise outlined those areas that are well covered with the existing policy instruments, identifies areas where more attention is needed by policy makers, detects bottlenecks and accelerators and reviews institutional capacities in place to implement the SDGs.

Scaling up and gaining efficiency with artificial intelligence

We know it is important work, so we were on the lookout for innovative ways to make this process and other similar policy mapping exercises easier and more efficient. In January 2018, we heard about a pilot initiative between UNDP and IBM research which demonstrated that an artificial intelligence (AI) approach could be time saving and provide accurate mapping information. Using AI based on natural language processing (NLP) techniques could be successful in automating the rapid integrated assessment process that provides a baseline to measure future progress. The assessment, which looks at defining a roadmap for a country to implement the SDGs, was our starting point. They piloted the assessment in five countries where policy documents were available in English. We got satisfactory results from the pilot.

We teamed up with local policy experts from the SeConS Development Initiative Group, an independent organization which aims at contributing to the long-term socio-economic development and improving the living conditions of individuals and social groups in Serbia and the region. Our team at the UN also met with natural language processing experts from the School of Electrical Engineering at the University of Belgrade to take this initial pilot to the next level – and research how the assessments could be translated from English into another language, thus for the first time facilitating an automated mapping of policy documents in Serbian. The development of the methodology and testing of the automated policy mapping exercise in Serbia is being implemented between August and November 2018.

Talking to a computer in Serbian is not as easy as Siri makes it look

Thanks to an abundance of language tools, resources, and algorithmic NLP models available in English, the initial pilot allowed for an automation in countries where English is the predominant language for official documents. In the attempt to translate the automated text processing to Serbian, our team noticed several linguistic traits that make this work particularly challenging:

Unlike English, Serbian changes in form according to grammatical functions such as tense, mood, number and gender.
Serbian is a fully digraph language, meaning that it can be written using two different alphabets (Serbian Latin and Serbian Cyrillic script). Latin characters often appear in Cyrillic texts, especially where foreign terms (usually from one of the European languages) are presented verbatim.
Although Serbian grammar often uses the same default Subject-Verb-Object word order as English, the very nature of the language makes word ordering more flexible.

In addition to the language-related challenges mentioned above, we also identified the following specific context related advantages:

The automated policy mapping will focus on specific sectors – social protection, health, education. In this area we have adequate data both in quality and in quantity. Given the specific focus of the automated analysis, we will be able not only to compare automated versus manual policy mapping results, but also to get a more specific idea of the data gaps in the social, health and education sectors, which is very important for localizing Agenda 2030 in Serbia.
By closing sectoral data gaps for nationalization process for the global goals, the pilot project in Serbia will also create a baseline to support the country’s SDG reporting obligations. This is particularly relevant given that Serbia will provide its first voluntary national review at the High-Level Political Forum in New York in 2019 on its SDG progress to date. The voluntary national reviews aim to facilitate the sharing of experiences, including successes, challenges and lessons learned, with a view to accelerating the implementation of the 2030 Agenda.These reviews also seek to strengthen policies and institutions of governments and to mobilize multi-stakeholder support and partnerships for the implementation of the SDGs. The Republic of Serbia will present the results produced by the automated mapping on achievements in the area of reducing inequalities in the country.

Getting started, getting technical

Our first step was to choose a sample of the 17 SDGs to be analyzed, limiting the dataset. Taking into consideration the quality and format of data available, and keeping in mind that next year’s voluntary national review discussion will focus on inequality, the team selected five SDGs that are clustered under the heading People, including:

SDG 1: No Poverty: end poverty in all its forms everywhere
SDG 2: Zero Hunger
SDG 3: Good Health and Well Being: ensure healthy lives and promote well-being for all at all ages
SDG 4: Quality Education
SDG 5: Gender Equality: achieve gender equality and empower all women and girls.

The second step was to consolidate the document database previously used in the manual assessment process to ensure that documents were available in a machine-readable format. This presented our team with a significant technical problem, since most documents were available in PDF format, which is not great for precise text extraction.

Initial tests indicate that a combination of Adobe Acrobat Pro’s text extraction mechanism and a replacement procedure through which particularly problematic PDF files would be replaced with an easier to read alternative (e.g. Word files) could prove to be successful in tackling this problem.

The months ahead

We expect a number of technical innovations to surface from the process of adapting the proposed AI approach to texts in Serbian. The complexity of texts in Serbian will be decreased through the use of stemmers, tools that reduce each word to its stem (a stem is similar to a word’s root form). Such tools have been found to increase natural language processing model performance on several semantic tasksin Serbian, so there is good reason to believe these tools may be effective with the similar, albeit more complex, rapid integrated assessment exercise.

Our initial efforts show that flexible word ordering is not likely to be a major issue in terms of transferring the (English-centric) automated pilot exercise to Serbian, since the AI method focuses on sentence or paragraph-level semantics, where the exact word ordering becomes less important.

Finally, we will work around the lack of available data from the manually-conducted rapid integrated assessment in Serbian by setting up a simulation, dividing the available Serbian document collection into two groups: a training set and test set. By conducting a manual rapid integrated assessment for the training set, a foundation will be established for the automated assessment for the test set in Serbian.

After these technical and algorithmic adaptations have been completed, the School of Electrical Engineering at the University of Belgrade and SeConS will measure the effectiveness of the AI method using the data from the manual exercise conducted in Serbia earlier this year and will submit a report showing the comparison between the two report, more importantly we be looking to see if the accuracy of the AI driven report the same or superior to the manually produced rapid integrated assessment report.

Despite all of the linguistic and technical challenges, this project could prove to be beneficial for data collection and analysis processes not only in Serbia, but also for neighboring countries, due to close linguistic ties within the sub-region.

We will discuss the results of this pilot exercise extensively with data holders, producers and users, including the Government and civil society partners, to obtain their valuable input to inform the way forward. The UN Country Team will use the additional feedback to see if and how this automated policy data search could be used to save time and improve the accuracy of data analysis. Lessons learned will be applied to other activities in Serbia aimed at supporting Government efforts toward fulfilling their priorities towards Agenda 2030 in Serbia. The questions that we hope to answer in the follow up consultations include: Can we use automated policy mapping for other processes beyond the initial SDG data mapping? How can we use it to map the progress towards SDG achievement and its linkages to the EU Acquis?

Whatever the answers to these questions may be, we will keep you updated. Watch this space and follow our progress on social media.