IGF 2020 WS #122&20 Data to Inclusion: Building datasets in African Languages

Time
Tuesday, 10th November, 2020 (08:40 UTC) - Tuesday, 10th November, 2020 (10:10 UTC)
Room
Room 2
About this Session
How can access to data in low resource languages be strengthened? What policy measures can strengthen AI-driven innovation for reinforcing multilingualism online? The session will present projects from Africa and Asia on developing and using datasets in low-resource languages to strengthen access to information. It will be the springboard for the launch of the Open for Good Alliance by UNESCO, GIZ, IDRC, Mozilla Foundation among other founding partners.
Subtheme

Organizer 1: Bhanu Neupane, UNESCO
Organizer 2: Irmgarda Kasinskaite, UNESCO
Organizer 3: Prateek Sibal, UNESCO
Organizer 4: Philipp Olbrich, GIZ
Organizer 5: Naeem Uddin, Torwali Research Forum
Organizer 6: Jaewon Son, Korea Internet Governance Alliance
Organizer 7: Elliott Mann, Swinburne Law School

Speaker 1: Dorothy Gordon, Civil Society, African Group
Speaker 2: Philipp Olbrich, Government, Western European and Others Group (WEOG)

Additional Speakers

Workshop Speakers

Speaker 1: Dr. Joyce Nabende, Makerere University, Uganda, Civil Society, African Group
Speaker 2: Kathleen Siminyu, AI4D Network Africa, Kenya, Civil Society, African Group

Speaker 3: Roy Boney Jr., Cherokee Language Program, Western European and Others Group (WEOG)
Speaker 4: Subhashish Panigrahi, Civil Society, Asia-Pacific Group

Introductory remarks by UNESCO

Speaker list updated after combining the workshop proposal with WS #20 Exploring the future of endangered languages in cyberspace.
 

Moderator

Bhanu Neupane, Intergovernmental Organization, Asia-Pacific Group

Online Moderator

Irmgarda Kasinskaite, Intergovernmental Organization, Intergovernmental Organization

Rapporteur

Prateek Sibal, Intergovernmental Organization, Intergovernmental Organization

Format

Round Table - Circle - 90 Min

Policy Question(s)

The workshop seeks to address the following key questions:

  1. Can the Internet be used to revitalize minority, indigenous and endangered languages?
  2. How can Machine Learning and AI improve the availability of minority, indigenous and endangered languages datasets?
  3. What kind of policy frameworks can enable further actions on strengthening minority, indigenous and endangered languages on strengthening multilingualism in underserved regions?
  4. How can stakeholders best raise awareness of the issue of endangered and data-poor languages?

This workshop will highlight the following issues:

  1. Digital language endangerment
  2. The catalytic boost in the process of language extinction due to the Internet
  3. The digital presence of endangered languages
  4. The low availability of resources for the development of technical solutions
  5. The lack of existing benchmarks and research in the development of digital solutions.

Additionally, this workshop will explore the following opportunities:

  1. The use of machine learning, natural language processing, and artificial intelligence to combat language endangerment.
  2. The use of open source technology to combat language endangerment.
SDGs

GOAL 4: Quality Education
GOAL 8: Decent Work and Economic Growth
GOAL 9: Industry, Innovation and Infrastructure
GOAL 10: Reduced Inequalities
GOAL 16: Peace, Justice and Strong Institutions
GOAL 17: Partnerships for the Goals

Description:

The ability to deal with human language is an essential attribute in all information and communication technologies. Although there are currently more than 7000 spoken languages, less than 100 of these are flourishing in the digital world with advanced language understanding and spoken language communication technologies.

In the case of low resource, minor and endangered languages, there is a recognised need to develop solutions which ensure these languages still have a place on the Internet. Particularly, there remain gaps in terms of access to data for training statistical machine learning systems which could be leveraged for developing downstream applications. Such applications could provide for the digital inclusion of speakers of low resource language and hence their active participation in knowledge societies.

The UNESCO publication “Steering AI and Advanced ICTs for Knowledge Societies”, launched at IGF 2019, identified “strengthening cooperation between civil society and research institutes for solving problems facing local communities, for novel data collection models based on citizen science that can create data sets for AI that respect international norms for privacy and data protection” as an option for action to address the gaps in the availability of data for development and use of AI in endangered African languages (Hu, et al. 2019).

This workshop is proposed as a follow-up to the above recommendation and will extend beyond the focus on Africa to encompass a broader discussion on the impact of the Internet and technology on endangered languages.

The workshop would enable North-South and North-South-South collaboration at the IGF 2020 and would develop networks and agenda for the workstream on AI, Data and Languages for IGF in Addis Ababa. It will further provide useful inputs for the International Year of Indigenous Language (2022-2032)

Expected Outcomes
  1. A greater understanding of stakeholder and youth specific roles in digital safeguarding of endangered languages
  2. Outline strategies for next phase of dataset development in endangered languages, particularly in Africa.  
  3. Agenda for policy advocacy for language technologies and dataset development as part of International Decade for Indigenous Languages to be launched in 2022.
  4. A framework for North-South and North-South-South

Discussion Facilitation: 

Beyond the presentations by the speakers, this workshop will include a large open floor component, where participants can raise questions and comments with the speakers and with other participants.

 

The moderator will seek to garner participation from a wide variety of attendees – with a particular focus on those from underrepresented regions and demographics, such as the Global South and youth respectively.

 

During the discussion time allocated in the latter half of the session, discussion will be guided by the aforementioned policy questions, and by the earlier presentations by the speakers.

 

The organisers anticipate that representatives from the following stakeholders will be in attendance:

  1. AI for Development Network – Africa
  2. Data Science for Social Impact – University of Pretoria Research Group
  3. Data Science Nigeria
  4. Masakhane – Machine Translation for African Languages
  5. Deep Learning Indaba – African Machine Learning Conference
  6. UNESCO Chair in Data Science and Analytics, University of Essex, United Kingdom
  7. UNESCO Chair in Artificial Intelligence, University College London, UK
  8. UNESCO Category 2 Centre – International Research Centre on Artificial Intelligence (IRCAI), Slovenia
  9. African Academy of Languages
  10. GIZ, Germany
  11. IDRC, Canada (TBC)
  12. Universal Labelling Project, USA (TBC)
  13. European Language Resources Association (ELRA) (TBC)
  14. Open for Good Alliance

Relevance to Internet Governance: Part of the importance of Internet Governance is how it evaluates the consequences of the Internets rapid raise. Language endangerment should be seen as one such consequence.

As set out in the  Los Pinos Declaration on the Decade of Indigenous Languages (2022-2032); which called for the design and access to sustainable, accessible, workable and affordable language technologies. Both UNESCO’s 2003 Recommendation concerning Promotion and Use of Multilingualism and Universal Access to Cyberspace and the 2020 Los Pinos Declaration on the Decade of Indigenous Languages (2022-2032), recognize the potential of digital technologies in supporting the use and preservation of low or under resourced languages.

This workshop will analyse the work needed to right the wrong created by the Internet, by focusing on the technologies and policy settings needed to revitalise endangered languages. For example, UNESCO’s International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, organized in December 2019, underlined efforts to develop spelling/grammar checkers up to speech and speaker recognition, machine translation for text and audio, speech synthesis, and spoken dialogue among others as important areas for enabling linguistic diversity and multilingualism.

This workshop will also highlight the work remaining to extend these technologies to under-resourced languages. This situation puts the users of many languages – a vast majority of Indigenous languages – in a disadvantageous situation, creating a digital divide, and placing their languages in danger of digital extinction, if not complete extinction. This work will require a multistakeholder effort – further linking this workshop into Internet Governance.

Relevance to Theme: The proposed session is related to the selected thematic track of “Digital Inclusion.” Frequently, as the Internet has very little or nothing to offer in the marginalized and endangered languages, and indeed oppresses them, these language groups lack the digital presence as they are underserved and suppressed.

Particularly in Africa, UNESCO has been vocal about the need for enhancements in language resources to enable technology solutions which can assist people limited by their language to interact in cyberspace. A salient example, in the context of the COVID-19 crisis, is how investment in open solutions for language technologies could lead to long term capacity enhancement to respond in public health crises is in the form of text analysis methods can be used to pre-warn health authorities of the outbreak (Tsvetkov 2017). For instance, social media posts in endangered languages could be analysed for outbreak of flu. This capacity simply does not exist at the moment – which is an issue this workshop seeks to address.

Online Participation

Usage of IGF Official Tool. Additional Tools proposed: UNESCO Teams to facilitate participation of UNESCO field offices networks in Africa

 

Agenda

Agenda:

  1. 8 mins – Introduction
  2. 32 mins – Panel discussion
  3. 37 mins – Open Floor Discussion
  4. 10 mins – Conclusion & Session Summary