IGF 2020 - Day 7 - WS20&122 Exploring the future of endangered languages in cyberspace

https://www.intgovforum.org/content/igf-2020-ws-12220-data-to-inclusion-buildin…

The following are the outputs of the real-time captioning taken during the virtual Fifteenth Annual Meeting of the Internet Governance Forum (IGF), from 2 to 17 November 2020. Although it is largely accurate, in some cases it may be incomplete or inaccurate due to inaudible passages or transcription errors. It is posted as an aid to understanding the proceedings at the event, but should not be treated as an authoritative record.

***

>> MODERATOR: Let's start the session. Hello, everyone, I'm Prateek Sibal, I'm a Technology Policy Researcher at UNESCO with interest in transnational governance of AI, ethics of information, and media economics.

On behalf of the organizers of the workshop that includes researchers and youth representatives from Pakistan, Korea, Australia, along with UNESCO, I welcome you to the session which will talk about AI inclusion and building datasets in African languages.

Of the 6,000 to 7,000 languages in use today, (garbled microphone audio) and English is a major language. For millions of people lack of access to information in their language can mean that their education and their work is a bit imaginary and it lacks cultural context of their surroundings.

So today we are joined by an incredible team of researchers who are working to strengthen access to information in low resource languages across the world. And we look forward to listen to their insights as we build towards bigger projects.

I would begin by introducing Dorothy Gordon, who will be giving introductory remarks for the session. Dorothy Gordon is the Chair of UNESCO's Information for All Program and Board Member of the UNESCO Institute for Information Technologies and Education. She has worked in the field of international development and technology for over 20 years and is recognized as a leading technology activist and specialist on policy, education, technology and society in Africa.

Dorothy, the floor is yours. We look forward to listening to you.

>> DOROTHY GORDON: Thank you, Prateek. And good morning, good afternoon. Welcome everyone to what I hope is going to be a Seminole workshop on languages, data and multi-lingualism.

I'm really pleased to be here as the IFAP Chair because IFAP provides a platform for all stakeholders in the knowledge society to participate in international discussions on policy and guidelines for action in the area of access to information and knowledge.

We have been promoting diversity, linguistic diversity in technology and universal multi-lingualism, access to information, and knowledge in cyberspace as these can be determining factors in the development of a knowledge-based society. But we don't look at these issues in isolation. We look at them in the context of access to information and knowledge, preservation of information, especially digital content, media and information literacy, the use of information for development, and information ethics.

Over the past 10 years, we had many meetings on these topics. And these have been promoted by the Chair from the Russian Federation, the Vice-Chair of IFAP, and Chair of the Multi-Lingualism Working Group. And I encourage everyone to have a look at the Declaration on Linguistic and Cultural Diversity in Cyberspace.

I assume that anyone who is listening to this talk is already convinced about the value of language in terms of its transmission of heritage, as a main vector of communication, and the fact that its use in new technology determines the degree of access and participation in knowledge societies.

And we all know that the landscape or the information-scape is very skewed and we have very few languages which are currently dominant in cyberspace. And this has really been evident from March up to now as we see a deepening of digital divide partly due to fact that we have such a limited language environment online.

And it's particularly relevant for Africa as we expect that by 2050 we are going to have the world's largest population. And definitely we will have the world's majority of youth so I'm very happy to see the youth representatives on this platform.

And while we recognize the value of languages for our socialization, we should not forget that languages have been factors in domination and separation of individuals and nations for the purpose of changing their identity as well as their capacity for analysis.

And I am very concerned as an African because many parents on the continent, they do not speak our languages to their children in an effort they think to promote their access to jobs and to a more sophisticated lifestyle. And as a result, they are affecting their brain development because if you speak to your child in a very limited vocabulary, we see that this has huge impact on their cognitive functions.

Those first thousand days are very important for children, and so multi-lingualism must be a priority topic for all of us at both an intuitive and a rational level. And when you look at it, you see that for natural language processing there is really a tiny amount of linguistic resources and research that are looking at the African language context.

So I'm really pleased that today we have several researchers from African universities and Civil Society organizations that are involved with this workshop. And they are working towards the development of datasets in seven African languages across 22 countries reaching about 300 million speakers through the UNESCO project.

And, as you know, UNESCO has as a priority Africa. So we need to further scale these projects and support downstream innovation for people to develop innovative solutions for strengthening the access to information through AI-driven innovation.

We also need to ensure that we are not introducing new divides as we pay attention to certain languages and we leave other languages behind. One of the ways we can do this is by documenting our process in such a way that anyone who is excited by the idea of using these tools to get their language online will find it that much easier to do this. And, of course, I'm sure that everything that we do is being licensed openly.

So I was a bit worried when I saw a relatively male dominance of researchers, but I'm hoping that we are going to work to actively ensure a greater inclusion of women and young women as well at all levels of the research and not just as passive kind of yes, we call them to a meeting, but really as drivers for this whole process. And I'm pleased that at least we have two women researchers here who are already mentoring many people in this area.

I want to also encourage us to -- in this age of online courses to ensure that African universities are given the opportunity to upgrade their personnel in computational linguistics. We can do this relatively easy. I know the group I work with at the University of Ghana really wants to get these courses going. And that way we deepen the resource pool. Then I have already discussed with you a bit about making sure that your methodology is open.

And in closing, let me just say that we are not really just talking about the creation of a dataset here, but we are talking about the creation of a multi-stakeholder network that actually understands what it takes to get this to happen and who can continue to broaden its base and get more people involved. Data scientists, Civil Society organizations, youth activists, linguists, even high-tech companies. We want all of them to get involved.

And in closing, just let me remind everyone, nothing for us without us. This is what was key message from the road map for the decade on Indigenous languages. And I want to remind us of the Secretary General's road map for digital cooperation paid considerable attention to multi-lingualism.

I look forward to the outcomes of this workshop and the projects being discussed, and I will be extremely happy to discuss potential opportunities for bringing these initiatives to the attention of all Member States of IFAP. Thank you very much.

>> MODERATOR: Thank you, thank you so much, Dorothy, for emphasizing importantly the value of inclusion not only in terms of stakeholders but also communities who are affected and who have the possibility to be left behind online.

And on this note, I would actually invite our colleague Christian from GIZ who would share good news regarding openness and building an alliance. The floor is yours, Christian.

>> CHRISTIAN RESCH: Thank you very much for the introduction.

I'm very happy that I'm here today to speak on behalf of a group of people from like UNESCO and business, and I think as well as Joyce from Makerere University who will speak later. And what we would like to share today, this group of people gather to form what we call the Open for Good Alliance of a number of partners including UNESCO, Makerere University, GIZ from Canada, KNST from Ghana as well as Mozilla, with whom we work closely, to share a commitment to make open training data available. Following on what Dorothy just said, I think it is also part of the Chair's commitment that much more than data is needed.

But the experience we made and I think the story where this came from was a bit that we work in this field and a lot of us work -- have a strong focus on languages, and also on languages in Africa is that there is a lot you can do. A workshop like this, you can do a webinar, and you find people who work on this all across the continent, but also all across Asia and many pockets of this world.

You can like call to talk to people and at the same time, it is a new topic. It is a new topic. There are a lot of open questions. There's a lot of learning to be done. I mean don't think we claim that we really know all of the answers. But I think we together saw is that there is a need for coordination and a need for platform and initiatives and an alliance to enable the coordination to share what works good, to share what maybe always does not work that good. To discuss the problems there are.

And this kind of brought us together, and I think the idea started already awhile back to form this alliance. And this then now we are very happy about this formed Open for Good.

What I think is the very fair question there is why we need another alliance. I mean there are a couple of them around already. I think where we see the need or where we think makes Open for Good unique are two things. The one thing is the strong purpose and the commitment to localized data. And this literally follows on what Dorothy just said.

And the problem is that, as I said, we have brought forward, for example, work a lot in Rwanda and Uganda and they have languages that are spoken by millions of people that are currently left out. And I find the commitment, for example, of Joyce in Uganda wonderful that we need the training data to open up the technology to represent those languages in the 21st century, but currently this is just starting.

It isn't done yet, and there is a lot of work to do. And this is just an example which also goes to a lot of other spheres beyond languages.

And the other point, as open is in the name itself, but where we may be for data, what we think it is crucial is that these resources are openly available, that people can contribute to them, that people can also use them to develop, to bring the languages forward to use digital technology also to make information in their language accessible.

And this is a -- is something we hold very dear which we find crucially important. And this will go together and now we are kind of ready to like step a bit out of the shadow and into the light, launch it publicly. And this is something we would like, most wholeheartedly like to invite you to. We will have a launch later this month on November 25. It is a Wednesday at 3:00 p.m. Central European Time.

If you would like to get in touch or get further information, I'm happy to share them via mail if you leave a mail in the chat. I will -- or we will get back to you. Otherwise, I can point you to our Twitter account where we will share further information, which is @fair_forward.

But if you haven't followed the reasons why we would like you around, let me just end, reiterate we would be very glad to have you along. Two things, we would like to with the launch not just say that we are here but like start what we wanted to do, which is to start. During the launch we will focus on representativity and the prevention of discrimination through this data.

But we also are very curious to hear your insights, and we would like to see what you think about these topics and also would welcome you to see which options and which insights might be in there for you.

So I hope to see a lot of people we see today again on November 21. More information will be shared then later. Thank you very much.

>> MODERATOR: Thanks, Christian. Thanks for your remarks and introducing the alliance.

I just want to quickly underline that the alliance is going to be a both community-driven initiative where we are listening to the stakeholders and they would be driving the initiative forward in different parts of the world.

Without much further ado, we have already heard a lot about Joyce. I would like to give the floor to Joyce.

Dr. Joyce Nabende is a lecturer in the Department of Computer Science at Makerere University and the Head of the Makerere Artificial Intelligence and Data Science Research Lab. Joyce holds a PhD in computer science from Einthoven University of Technology, the Netherlands. And currently Joyce leads a team of researchers that carry out research in the application of machine learning techniques in solving problems related to agriculture, transport. And she is an expert in natural language processing.

Joyce, we are looking forward to listening to your remarks. The floor is yours. Thank you.

>> JOYCE NABENDE: Thank you very much. Can everyone see my screen?

>> I suspect you may have shared the wrong screen.

>> JOYCE NABENDE: Oh, dear. Just a minute. I hope that is the correct screen.

>> That is better.

>> JOYCE NABENDE: Perfect. Thank you so much for the invitation to speak about the work that we are doing at Makerere University in trying to view datasets in African languages. So in this talk I will briefly speak about the work that we are doing at Makerere to drive and have more language resources for NLP.

Just a minute. So I will start by just giving a brief introduction about languages in general in Uganda, and then the context of our work from Makerere.

So as you can see from there, Uganda is a multi-lingual country with over 40 Indigenous language dialects. However, we have a gap that is existing due to the lack of available datasets for many of the languages that we have in Uganda, especially on the internet. And most of this end up being low resourced and they lack large multi-lingual or even parallel that is necessary for building NLP applications.

Based on this demand from the African continent, there have been several growing initiatives to create NLP data. And I just list an example some of the things that have been going on around common voice, and around Makakhane and also especially the AI for digital challenge focus that has been mainly around building language resources in Uganda, Ghana and South Africa. So we are lucky that Uganda was one of the countries featured to help build our language resources.

And Makerere University participated as part of that challenge, and we have been able to build our datasets. And I will talk about that later as we move on.

When I made the slide, I said Uganda languages were missing but not anymore. Since September 2020 we have a big initiative that has happened for Uganda languages on the common voice platform. So we are making the list sets and we are trying to move there.

So I just highlight several of the NLP projects that we have been carrying out around Makerere University. We like to move NLP to the researchers that we have in the Makerere University and in the Makerere AI Lab.

One of the projects we have been doing is looking at key word spotter models and building automatic speech recognition models from radio recordings. So if we look at the majority of the people in our rural areas, they will try to communicate and put up their views best on radio. And that is where they will go. And they have the hope that the government and people in policy, that people who are making decisions will hear them.

So part of our work has really been around trying to mine what people are saying on radio and trying to understand, for example, around process and business surveillance, but also understand trends, perceptions, mentioned especially now around the COVID-19 pandemic.

And that is a very good project because you actually get to sift through the so many radio recordings that you have in so many different languages. It looks like a very easy thing to do, but it is not an easy thing to do, especially if you do not have data. And I will talk about the efforts that we have made in the area.

Another piece of work that has been going on has been around building natural language processing for low resource languages. And there are several different high stream machine learning activities and tests and several applications that have been done around several languages like Acholi, Luganda, Lumasaaba.

So one of our researchers has been trying to build models around machine transliteration from one language to other. We have also been able to obtain and build compare which is necessary for machine translation.

We have a PhD student who is also working on building grammatical frameworks for the language. Also we have another researcher who is looking at text-to-speech models, especially looking at spoken dialogue systems, again, around Luganda. And we have also been lucky to work which is important for question and answering and search engines and even information retrieval.

So we have been able to start with Luganda because Luganda is like the widest spoken language here in Uganda and you will see that most of the work that we are driving around most of this project has started with Luganda as our main language.

So as I said, media limitations around all of these very brilliant ideas that we might have has been around data. So, for example, to build a grammatical framework has struggled to get the resources. They are just not available. They are just not out there. For example, the work we are doing with radio, you can't just get the radio recordings and build open models and put the data out there.

There are so many steps that you have to go through for licensing, for having to openly put out this data. And these are many of the things that are driving us Open for Good, how can we have this data, the local data openly available that any researcher who is interested in building NLP models can quickly pick up on that data and be able to build their application that they care about.

If you look at the available resources for Ugandan languages, of course we always start with the common one, the oldest, which is the bible. But we also know that the bible has very limitations. Right now the way people write, the way people speak, even when you listen to the spoken version of the bible, it is quite different from what we are experiencing now. It is little dialects that are changing.

People usually code switch. We even have a version of English in Uganda which is called Uglish. So all of those complications as you are trying to build both speech and text resources our Ugandan languages. We also can get Luganda Wikipedia, which is also limited. And try to partner with the Institute of African Languages here at Makerere University. They have their resources, they are a bit limited, but we are also moving forward with them because they are the language experts and it is very important to have this partnership together if you want to build up these NLP resources that you require.

So what efforts have we put in place to get to where we are now? As I said, that we are likely to be able to have data driven through the AI for data engine and we have been able to come up with agriculture keywords and we have heard recordings of these keywords. And voices and spoken and given us enough keywords that launched a challenge which is ending this month for building a keywords spoken model for agriculture. So that has really been very, very helpful for us.

We are also in the process of finalizing with licensing with one of the radio stations who appreciate the value for building up this automatic speech recognition models because that is very important for them to even sift through these radio recordings that they have.

When did people speak, if they are, for example, interested in a particular topic? They can go through that once the model is deployed and get out the meaningful transcripts or the meaningful pattern of the radio recordings where these particular things are mentioned.

So we have also made an extra effort there with the Institute of Languages. We are also trying to create parallel work across several languages we care about here in Uganda.

Most importantly, in September this year, we have been able to launch Luganda on the common voice platform. Because if we want to build this into the speech recognition, we are driven with the fact that we need a lot of data, both speech and text. And o getting the radio recordings is good.

It has been really slow because after getting the radio recordings, you also have to transcribe the data. But you need, again, lots and lots of data.

So what we have done as the Makerere team in the AI Lab, we have been able to partnership with GIZ and Mozilla and we have been able to add and launch Luganda as a language on the common voice platform.

Our efforts are under way amidst COVID and university opening up slowly to try and drive as many contributions as possible from the community such that we can be able to build up and have a very good speech-to-text compare that we can use for building the deep speech models. And we hope that this can get started before the end of the year, that we increase the number of hours on the platform.

We have also been encouraged by the work that's going on, they have done work in pushing on the common voice platform. So that has been an inspiration as we try to do the work for Uganda.

And because we care about open AI training data, we also partnered up with GIZ to come up and build open AI training videos that we put together. And we have a series of courses and open and unbiased data. And we have been focusing on how can we help people in the African continent to have access to often AI training data.

We have five a series of five videos where we are talking about how to access open voice training data. And we are focusing on Mozilla because it is like the portal where we have the experience with especially at Makerere University where you can easily go in, download, and build a simple deep speech model.

We are also focusing on how to access open data. We have also been able to talk about our work since open image training data much of our work in the AI lab has been around computer vision and building models for agriculture. So we also want to help people to kick start and get to the step but how do you access the data. When you access the data, then what? What can you do with this? So the video can also help to be able to do this.

But also because with that data that we are coming with and with the Open for Good Alliance, we want to make sure that people know where to put their data, how to prepare their data to put it out openly for people to have access to it.

So we have a video that is also talking about how to prepare open AI training data. Because that is important. Yes, you can have the data, but the data is closed and no one has access to so how do you put out the data that it is fair and people can have access to it, it is findable and all those things that we care about in the fair principles.

And finally, we have also been able to talk about how to eliminate biases in AI training data. That is very, very important. Even as we are trying to run our own with the NLP data and voice data on the common voice platform, it is so critical that we take care of so many things around bias.

For example, having as many female contributors as possible. Having diversity in terms of age. Such that we can have a very good balanced dataset. So these are the things that we feel like are important as we drive to make sure that we build NLP resources and datasets for the African language that we want to do it the correct way.

As such that we can have at the end of the day a very good model that is useable for everyone. So for fairness and equitability, efforts have to be put in place to develop NLP resources for the African language.

And I believe all the people on this call are interested about this, and we are all together to guide and provide support for everyone in their journey toward achieving this goal. Thank you very much.

>> MODERATOR: Thank you so much, Joyce, for this really illuminating talk. You touched upon different kinds of data resources available from radio to bible.

And on that one we have a question in the QA on fonts and where to find characters related on different languages, if you could take the time to answer that.

But before that, I just have one quick question before we move to the next speaker. What are some of the practical applications you have already been able to look at or work on using, for instance, voice data or other kinds of datasets that you created, if you could share quickly with us?

>> JOYCE NABENDE: So for our voice data what we have been about to do is we have our project of mining the radio recordings. And so that that dataset that we are trying to build up from that -- from common voice platform is very important because now we are trying to build our end-to-end speech recognition models best on that data and trying to apply that on our radio recordings.

So we are in the process of collecting the data, but the simple model that we started with is just simple keywords or the model. And so the data that we have in the challenge, we have also been working with it inside the lab to try and look at if I have a radio transcript and I run a simple keywords and start a model and I know that I'm looking for a particular keyword, for example, where I mention maybe a particular pest or a particular disease, can I be able to flag out that transcript with the highest probability that the particular keyword has been mentioned?

So we started with the simple keyword model. And as soon as we have as much data as we require from the common voice platform, then we can be able to build the speech models. So that is one of the applications that we are running with.

>> MODERATOR: Thanks for sharing that. And I think this is a good entry point to Kathleen, who has leading a lot of the work in Africa building communities, mentoring a lot of people, and for us is definitely a big inspiration.

So Kathleen Siminyu is an AI researcher who is Regional Coordinator of the Artificial Intelligence for Development Alliance. And Kathleen has been -- hi, Kathleen? Yeah.

And Kathleen has a long experience of running -- working with AI communities on the African continent including the Nairobi women and machine learning and data science community. We had the opportunity to actually organize a workshop with Kathleen and some of her colleagues at the deep learning endeavor on questions of AI and fairness, AI and bias, AI and gender equality. And it was a really enlightening experience to bring all these communities together.

Today we are learning -- we are here to learn from Kathleen about the participatory research environment that she is driving forward in Africa.

>> KATHLEEN SIMINYU: Thank you for the introduction. I hope you can hear me all right.

So my presentation today is titled specifically Research for Building in Low Resourced Settings. And I will be using a case study of African languages, obviously, and particularly two projects which we shall speak about a little later in this presentation.

So talking with attempting to define what low resourceness is. So with this table, I just tried to illustrate the fact that there is a huge imbalance in between the number of speakers that speak some of the languages in the world and the available resources particularly on digital platforms and with this table particularly on Wikipedia.

So I will draw your attention to, first of all, English, which has 1.2 billion plus speakers in the world. And then 6 million plus articles available in Wikipedia.

And then below the line that is under English, that is all African languages. And particularly -- well, African languages are low resource languages.

And I will draw your attention Kiswahili, which is the largest most widely spoken language, I think, with 9 million plus speakers. But in comparison to English, there is only 59,000 Wikipedia articles available in this language.

Yeah. So why does this problem exist? It is actually a very complex situation. And what we decided to do is to reflect on the entire NLP process in particular because this is a field that we are in to understand what the problems are and what can be done.

And this is -- we created this I guess diagram to explain the relationship and the interaction between agents and the interactions that they have with the external resources that they require to create and participate in this cycle.

And not forgetting the stakeholders. So where the stakeholders are individuals who are impacted by the artifacts created from this process or, indeed, in this case, the low resource nature of the artifact.

So talking back to the content creators, and they are the primary group in the cycle really. In NLP, we speak a lot about datasets but really these are the work of authors and writers and musicians and the like.

And, of course, we find that there is an abundance of digital spoken languages such as English and French. And the reason why this is not the same for African languages, the reasons actually varied. They range from a lack of tooling for language -- specific languages such as the keyboard. And this actually goes back to the question that Joyce had been asked because you will find that there are African languages that have special didactics and those require special keyboards. But if those do not exist, it is already hard that I as a writer have to create any content in my language.

In addition to just a lack of availability of online platforms that can be used for didactic creation or rather collection, and then there is also a lack of an audience.

Particularly for political and historical reasons, Ngubi Wath Thiongo described in his book being punished in school for speaking in a language other than English. So he is a Kenyan writer.

And he described a time as a writer in precolonial times when they would be jailed for publishing books and articles in African languages. So this is just further to the fact that content creators really have not had an audience or had an environment within which they could create in these languages.

Second, we move to the translators. And really just following up from the challenges with content creation, if there is no audience to consume the content, it means there is also no audience or demand for translation. And that being one factor.

The translators would also require additional tools for their job such as dictionaries. And you find that in many low resource languages no one has taken the time to sit down and formally create tools such as this.

And then we move on to the curators. These are individuals who are putting together datasets. In an ideal situation, they should be individuals that have a knowledge of the languages and they would require access to web crawlers and to know the language IDs of relevant languages. So in many contexts you may imagine that language IDs is a solved problem, but in many low resource languages we start to realize that there may be either a duplication or an exclusion of low resource languages and this also causes a challenge to the group of individuals.

And then we move to the language technologist. These are individuals who largely deal with model creation. They are the ones who build the computational tools. And they will require access to machine learning tool kits.

Again, if we go back to the fact that we are working at low resource environments, a lot of these tool kits need a lot amount of compute to run. So that presents a challenge. And also the fact that preprocessing needs to be done, for well-represented languages on digital platforms because they are a no-brainer and make the engineers' work very easy. But then in the low resource context you almost have to reinvent the wheel and think universally about how you are going to prepare your data to train any sort of model.

And then we have the evaluator. This is an individual who is tasked with evaluating the quality of results from tool sets language technologies have made. Ideally, an evaluator should understand the language that they are evaluating. In practice, when it comes to low language resource the evaluators will often not understand the languages and will instead resort to using and relying on exclusively Korean metrics as a measure of evaluation. For example, the blue score which is very commonly used in Russian translation tasks.

Whereas if now you are in a high resource language, whoever the evaluator is, in addition to relying on an evaluation metric, is also able to give feedback to evaluators based on their human evaluation of some of the output. And this feedback will help the language technologists to improve the model. So we find that particularly in low resource language work this feedback loop is excluded.

And in an ideal world, if all languages were high resource we would have or we would find that each agent in the cycle has the tools to carry out their role and the process is cyclical such that if one agent is able to do their role, then they are able to provide the next agents the material to then effectively do their role.

But then the reality on low resource scenario is there are breakdowns in very many points of the cycle. And in each segment of the process, if there is a breakdown it affects consequences downstream, making the entire process very much constrained.

So we conclude that low-resourced-ness is more than just about the data, it is about the society. And really, who is society? That is you and I and everyone around us, right?

So what is broken? What are some of the common failures? We find that African -- a lot of not including African languages in NLP research. So African languages are largely not included in NLB researchers and not involving native speakers in the development is a problem. A lot of NLP work is centered on English-centered data only. So this brings a disconnect in terms of contextual understanding if we are always just not looking from within and looking to build resources based on oral tradition, I would say, but rather looking to translate or copy and adapt from external. And then finally, not involving native speakers in the evaluation process.

I'm now going to go into some of the ways in which we are tackling this problem.

Both looking to make it a participatory research. So using this as a means to ensure that everyone who should be in the room and should be contributing to this work is actually in the room and able to contribute to this work.

So hypothesis one, the participatory research can facilitate low-resourced machine learning development. And the project that best speaks to our efforts in this area is Makakhane. So Makakhane is a grassroots NLP community for Africa by Africans. And it started initially for questions on machine translation.

Which you will consequently see a lot of the learning related to Makakhane are based on but then it has grown to involve a wide variety of other tasks.

What is Makakhane? We are the efforts are open source in that all models, data, code, even down to meeting notes, every time we have a meeting are freely available on GitHub. It is continent wide. At last count, we had researchers contributing from 30 African countries now. Many of us have never actually met in person and really only know each other by our avatars or our digital pictures. It is largely distributed and entirely online thus far.

So again, I mentioned that Makakhane began as a movement for machine translation. And these being some of the biggest wins. So far, there's 47 translation models for 35 African languages that have been published. And again, all of this work is open sourced and freely available.

And also evaluation work that we have most recently undertaken will be trained machine translation model is the translating of COVID-19 surveys and TED talks.

So we went through a process of doing those translations and then doing an evaluation iteration of having individuals post edit for nine languages. And when this presentation is shared, you can have access to the full list of languages trained.

My slide stopped moving. Okay. Again, Makakhane is research focused so we have seen great success with papers published in the past year. Particularly, Masakhane published with two papers with I think about 20 collaborators each on those papers, which is pretty cool, in my opinion.

So the most recent one has been published to ENLP Findings and also to the African NLP Workshop which was held earlier this year in collaboration -- sorry, collocated with the online ICLR.

And there is papers on really variety of other topics from researchers dedicated to Masakhane. And again, the full list will be available in case that is something that may be of interest to you.

And participatory research can certainly help the development of good high-quality African language datasets. And the work to test this has been to facilitate or to run a language fellowship program which has led to the creation of a number of datasets. So you see the languages represented here at the bottom.

And really, I guess what happens to bring about language fellowship program is we got an opportunity to have some work funded. And to put that into context, Masakhane is largely run on the goodwill of individuals and on the energy to work on stuff in their free team.

So when the opportunity came to fund some work and to really engage researchers full time on a problem, really the common shared problem that none of us can escape is a lack of data. And so this is what we chose to do with funding that became available on the premise that when data curators originate from the States where languages are spoken, creative and culturally grounded forms of dataset creation can take place.

So through the work with the fellowship and the various fellows, we've created the datasets for machine translation, text to speech, sentiment analysis, and document classification.

The slide gives an outline of the languages involved and the countries where the teams originate from. And we have seen a variety of methods and processes used down from engaging with translation communities or engaging with community groups like church groups and hiring translation, let me call them parties, where individuals sit down and create datasets from scratch.

It has been a very eye-opening process to see the creative means in which people want to create such as this.

In addition to the datasets that are resulting from this process, other outputs that we have created through collaboration with partners and really just using each dataset creation cycle as a case study is IP and copyright guidelines which are pretty important considering a large number of the fellows are creating datasets from data that is already existing on the internet. So they are doing a lot of scraping of websites, newspapers, books, scraping of text and data. And then data protection and privacy guidelines, again, because data sources do include social media platforms and just other data sources which include personally identifiable features of individuals.

And this work has been funded and made possible through very many partners who I'm going to try and acknowledge now. So there is the Knowledge for All Foundation. There's IDRC and Swedish SEDA. So IDRC is the Italian Development Agency and SEDA is the Swedish run. Of course, UNESCO. And GIZ through the Fair Forward Initiative. There's Strathmore Center for Intellectual Property and Information Technology, the Data Science Working Group, three such group at the University of Pretoria, and Zindi.

And I will leave you with Masakhane. We build together. Masakhane is a Zulu word that means we build together. And that's it. Thank you very much.

>> MODERATOR: Thank you so much, Kathleen. I personally loved how you presented the framework within the five points -- the content creators, the conciliators, the curators, the language technologists, and the evaluators. It really gives us, all of us an entry point into look at the challenges, the map with the fire and the heat that you showed on how to address each of these challenges.

And I think this is an important take-away as we work towards the outcomes of this and also to take forward the fellowship program to go beyond the content curators which is super important to have native language speakers doing that but also to address challenges in translation and language technology.

Thank you so much for this insight. And, of course, always looking forward to working with you and learning from you.

Moving on, we have Roy Boney Jr. who serves as Cherokee Nation Language Programs Manager. He is an enrolled citizen of Cherokee Nation. He has been working in Cherokee language revitalization programs for nearly 20 years, with 14 of them directly for the Cherokee Nation.

He has served as a cultural and animation specialist for the Cherokee Nation Chaka School where he created books, short films, and digital media content in Cherokee language.

We are really looking forward to hearing Roy here now. He has first-hand experience in generational losses in Cherokee language transmission, something that our previous speakers have also touched upon on how a lot of people can be excluded or there can be forms of colonization which still exists and how we need to battle those.

Roy, the floor is yours. We are looking forward to listening to you. Thank you.

>> ROY BONEY: Thank you. I'm glad to be here today. Let me get my screen sharing going here.

Can everyone see that? All right. I will talk about the Cherokee language today of (Speaking non-English language), as we say it in our language. We are a very small language comparatively to a lot of other languages in the world. We have about 400,000 Cherokees total. We are represented by three different Cherokee tribes in the United States.

So I work for the Cherokee Nation, so that's the largest of the three tribes. Overall with all three, we have about 400,000 citizens. And out of that, we have done a survey in the last two years or so and we've discovered there are only 2,000 people that are fluent in the Cherokee language.

So we're at a very critical point in our language. So we have engaged quite heavily technology to ensure that we can revitalize our language. We don't like to say preserve because that kind of implies that it's going to die. We're going to do what we can to make sure bring it back and revitalize it and bring life to it.

So I'll give you an overview about a quick history and then jump into how we are using technology. We have our own writing system. This was developed by Sequoia in the 1820's. This is a chart of his own handwriting showing what it looks like. We have 86 characters in our language. It is about 200 years next year is the anniversary of it. So we're going to have a big celebration among our communities about this.

But there are 86 characters. So over the decades from its introduction to the community, we adapted to different types of technologies. So back in the 19th century, they adapted from the cursive handwriting to the movable type to the printing press. And so from that we started doing a lot of printing.

As someone else had mentioned earlier, you know, we did a lot of religious materials from missionaries. So there is the New Testament is in Cherokee. Lots of other religious materials like hymn books and things like that. We also have traditional materials from our traditional communities in the Cherokee language as well.

We still print newspapers in Cherokee. So this is an issue of the Cherokee Phoenix. That's our Cherokee newspaper that has been in circulation since 1828. There have been a few interruptions, but for the most part it has been a resource for our people for a long time now.

Our newspaper is online as well. You can get a paper version. You can go to the website and read it in English or in Cherokee. We engage with our translators to record these stories in Cherokee and they write it as well so they can follow along.

And that serves as a very good tool for second language learners of Cherokee. They can see our languages as written and hear a natural speaker saying these words.

We engage with technology pretty heavily. We have a language technology program in our tribe. So what our people do there, we do a lot of vocalization and we've worked with Microsoft and Google and Apple and a bunch of the other tech companies to make sure that their products support our language. We also engage, we use the internet quite a lot.

We have a TV series called OCO TV. And that means hello in our language. And all of the episodes are put online. And they interview our culture elders and artists and other people doing important work for our tribes.

But we have language lessons as part of our TV series. So you can watch all this stuff on YouTube now and on our website here.

We also engage with the university so we have a telecourse in Cherokee language that people can watch on the TV air waves here. You can also take this class online for college credit.

And again, we work pretty closely with our Cherokee first language speakers. I'm a second language learner so I'm not fluent myself. I'm still learning but engaged in this process. And we want to make sure that anybody that wants to access our language can get to it whether it's online or through TV or radio.

So we have a weekly radio show as well that's hosted by a Cherokee speaker, Dennis Sixkiller, and he interviews our various elders throughout the Cherokee community. As of this year, we have about 950 episodes of this radio show. And you can listen to all of these episodes on our Cherokee Nation website. And we have them being podcast as well.

So this is another resource where you can listen to -- they are about an hour long, and hour-long interview with a Cherokee speaker all in Cherokee. You can listen and learn and hear about the way they grew up and various things like that.

So again, we're leveraging the technology to do this. It is broadcast on the radio, but if you miss it you can also listen any time you want.

And we are also engaged in research, documenting things like tones and all of that kind of stuff. I'm not a linguist so that is outside of my purview. For the radio show, for example, we have a university linguist who is working with our translators and they're going through these interviews and marking things like the grammar of the language and the tone and all this and they're documenting this. And it's just providing very valuable research to us as learners.

And we intend to share all this stuff with our learners as well because we have a master apprentice program, too, where we pair young learners with elder master speakers for a two-year intensive project. So they get engaged with the language heavily and they can use the language shows and TV shows and listen to Cherokee and learn while they're working with an elder speaker.

I mentioned we do a lot of vocalization. Part of that -- we had our syllabary encoded with Unicode. It look awhile for the communities to understand what that really meant.

Prior to that, we had a lot of different fonts floating around that were not Unicode compatible so that caused an issue with communication in the communities. You know, if you didn't have the exact same font you couldn't read what someone had written. So when we got past the barrier of getting our language into the Unicode standard, that was a huge leap forward for us in terms of technology because it opened the door to not only using our language on the internet but having this data that we could transport easily to another source, you know, and we would have the same code points for the characters.

So we are ensuring that we are sharing the correct information as it goes to another device or another system. And we can use this in databases.

And we operate a Cherokee language immersion school as well for children from prekindergarten to the sixth grade. And they can have user names in our language, they can have passwords in our language. So that provide an extra level of security as well using our writing system for that type of system.

So building on that, since we are -- since we do have our language encoded or our writing system encoded in Unicode, we can do things like we are trying to get into a lot more of machine learning, using datasets to generate verbs.

So this is our test. We created a Cherokee verb builder. Our language is synthetic so we can break it down by the prefixes and suffixes and stems and all this, but without it being encoded in Unicode we would not be able to do this type of work easily. So we're very glad that we have this ability to do this.

And it is only in recent years that we kind of moved in this direction. I know I heard earlier mentioned, too, a lot of times communities want to try to translate something that exists already into their language. So we are moving away from that and trying to create new content directly in Cherokee for our communities. And using this technology has helped because our translators, we have staff translators at Cherokee Nation and they create such a lot of content of translation like the newspaper I shared earlier, they translate that.

They need a place to store all this data so we're creating the databases, language core to hold all this so we can share this data with our communities as well for the learners and the other Cherokee speakers out there.

And the great thing about this is since most of our communities have smart devices, you know, they have some tablet or cell phone now, if they can get an Android device, our language is supported on Android and iOS, on Windows, and on Chromebooks. It's kind of any mobile device you can think of now our language has support in it.

So we have elders in our communities using our language. They're texting, e-mailing, posting things on Facebook in Cherokee. So they are communicating with their grandchildren and their children. Despite our language endangerment, we are having a very interesting period of revitalization, again through the use of technology.

We also have a tourism department. And so what they do is they try to engage the public around this that might not be familiar with the tribe or the language. And we do videos like this, we do a Cherokee word of the week. There is a couple of hundred of these videos on YouTube.

Again, we use our writing system and the English phonetics. And these videos can be shared with anyone. Again, a lot of learners access these.

And we have our own YouTube channel for the tribe itself which has lots of -- we have interviews with Cherokee speakers that we film. We also have online language classes that are archived. We have quite a lot of number of Cherokee language online classes. And especially with the pandemic now, we've moved online pretty heavily.

And on that front as well, we are in the process now of developing our own custom virtual classroom. So it is similar to Zoom, you know, what we are doing now, but it is going to be the interface will be in Cherokee for our students.

You know, so without, you know, Unicode technology, we wouldn't have been able to do that kind of thing either. So we're in the middle of doing that development, too, at the moment. So on our Cherokee Nation website, we have all our language resources easily accessible to the public.

We develop a lot of things in the technology realm for fonts and keyboards. We work with professional font developers like at Google. Google released a font for us in Cherokee. And we also engaged with Microsoft to do a UI font for Windows in Cherokee. And then we have professionals like at Adobe that have made fonts and things.

So we try to have the resources to create content in Cherokee as well. So we have been training a few people in our community, too, some of the younger people that have the how to you make a keyboard in Cherokee? You know, that's a skill that a lot of these kids pick up on pretty quickly because they play a lot of games that kind of thing and they understand the technology a lot more than our elders do.

So they are creating these new ways to use our language on these devices. And I should mention we also are doing animations. We have done animations for quite a long time.

They have been very small-scale short films, but we recently have begun this animated series called Inagei, which means in the woods. And it is a full professional level Cherokee language cartoon.

Coinciding with this, again because of COVID, we are making a language app using these characters to teach languages well. So they will have a various set of games inside the app. We are in the development process right now with this.

We have engaged with an app developer to do this process. But again, because our language has been encoded in Unicode, it has really helped us bridge this gap where we can send our datasets to the developers and they can input our language into it.

And on top of that, we're doing recordings as well so you can hear the language being spoken in the game. So for us, as I mentioned, we are a very small language community, but we are trying the best we can to leverage the technology.

So hearing the stories from everyone else, we are in a similar situation, and so I like to hear or see how people are using the datasets as well.

Because we are creating -- we had a grant prior to COVID where we have got all of these universities and other institutions in the United States that have Cherokee language documents in their archives. We digitized several thousand copies of all these documents that are written in Cherokee.

And so we're building a database where we are inputting all of this data in the Cherokee language and we're having it translated so in the future we will have a very large, centralized database of Cherokee language terminology. And it will include things like the transcripts from the scripts of this animation, for example. So we're going to try to include it all in one central area for access to our citizens.

So we have our own department at the tribe now. Prior to -- we recently were reorganized as a government internally, so our Cherokee language has its own governmental department now.

We used to reside under education because we just did classes for the community and things. But now with the departmental level structure, that will help us quite a lot more to engage with our community and hopefully be really successful in our language revitalization efforts.

If you have questions, there is our catch-all e-mail. It will be routed to the proper person in our organization. I will wrap up here. But thank you.

>> MODERATOR: Thank you, Roy, so much. I think one of the beautiful aspects of this webinar/discussion that we are having today is that we have so many diverse perspectives and so many key points.

For instance, in the beginning Dorothy mentioned about how younger people may not be engaging so much with their languages. But seeing the programs that you have implemented from animation to, you know, radio shows to pairing youngsters with people who have mastered the language, there is a lot of societal aspect, as Kathleen was mentioning, in actually revitalizing and spreading language forward.

On that note, I would invite Subhashish to talk about his experience. Subhashish is from India. And he is a steward in the openness movement. And he works at the intersection of openness, digital human rights, diversity and storytelling.

For over a decade, he has led strategy programs in diversity and inclusion a rights-based approach in many international Civil Society organizations, including the Wiki Media Foundation, Center for Internet and Society, Mozilla, Internet Society, and Engaged Media.

As a National Geographic Explorer and documentary filmmaker, he has documented many Indigenous and endangered languages under the ambit of Project Open Speaks.

So potentially there are some linkages with what Joyce is working on in Uganda and what Subhashish is probably going to talk about. The floor is open, Subahashish, looking forward to listening to you.

>> SUBHASHISH PANIGRAHI: Thank you very much. And thank you, all of you organizers, for giving me this opportunity. So I come from a country that speaks more than 780 languages, out of which 22 are officially recognized. That means that the provinces and the governments are encouraged to use those languages for governance. And that is very good news.

But what happens in reality is that the concentration rights and the policies don't always end up in implementation. And that has affected my country for a very long time. And as I have worked with many Civil Societies across Asia-Pacific, I have seen that in other countries as well.

Where our countries are seeing sort of a digital shift right now and e-governance is being emphasized a lot in the developing countries in the region. Whereas there is little emphasis on the socio-economic divide that exists in these countries.

Last year, I was working on a fellowship called the Digital Identity Fellowship. And one of the areas that I looked at is the use of languages in governance. And particularly providing critical information to people in their native languages.

What I found out through interviews for a documentary is that a lot of members of different Indigenous language communities don't really have any understanding about the -- when they comply to the government program and they share their permission for private data collection, they mostly comply and they don't provide consent because they don't understand and don't have the literacy of the legalities and digital rights that they have.

And because that becomes much more difficult, but also a very critical element right now that all of us that work in this domain of languages being used for governance, we need to look at how these stakeholders pay more importance and provide literacy in different languages around the areas of digital rights and human rights and also the Indigenous rights to begin with, which varies from community to community.

Moving on to the next area that I personally have worked for a long time, which is openness. And when I say openness, it sort of cuts across many different practical areas including open source software to open licensing and open content and open culture.

And something that is more philosophical, and that is embedded in many Indigenous communities, which basically means that knowledge is free and universal, and everybody should have equal and equitable access to that. And knowledge is something that is collaboratively created. And there is always a distributive ownership and there is always a feedback loop. And these foundational aspects are common in many Indigenous practices and many of the openness areas that we work with.

So when it comes to languages and the use of languages, particularly on the internet and digital domains, we see a digital divide. And I think it was Dorothy who kind of highlighted or it was probably a -- but if you look at the internet, it is highly divided and it is not democratic yet. It is not equitable. And the digital divide that we see also lies in many different areas starting from oral knowledge and written knowledge and oral content and written content and how much participation people have on the content that's available on the internet. And I think Dorothy spoke about that in the beginning.

And I think there are projects that are driven by communities. And those projects are certainly working towards reducing that digital divide and the gaps that we see, particularly from a rights perspective starting from Wikipedia. And Wikipedia, as we know, is itself run by volunteers and it's written by volunteers. And it is written, and there are many oral languages. My own country has so many oral languages that aren't not even documented and don't have a writing system so they can't be there on Wikipedia.

But those tools are useful when it comes to building machine translation and AI and other similar tools. So a lot of language communities are certainly working towards creating resources and they're creating open datasets, but a lot of community members also don't have the literacy about openness in the first place.

And when it comes to government programs, they certainly don't have a very clear stand. Sometimes the data that is released by many governments is open data. But the content, that is the documentation and other supporting literature, that isn't available in open licenses. So that means that the governments don't have a clear idea when they release data. So that kind of creates more problem.

So I think many governments have to sort of standardize these things. Look at what is really basic and fundamental and what is really important for the end users. If, say, a documentary filmmaker like myself is documenting a particular language, then their first and foremost duty is to make that documentary available under an open license so that the community itself won't have to deal with the copyright restrictions and license.

And that isn't the case for many academic research. A lot of researchers themselves don't have a very clear idea and that is not part of their studies as well that they need to release something after the end of their research. And their research items have to be openly licensed so that the community members themselves can access the research outcomes freely and openly.

So I think it is really important that we underline these aspects in our own work, whatever stakeholder we represent. And I think documenting our own work is also very important so that other people from different countries or other communities that might be doing similar work can use the documentation for their own research and for their own work.

So going to the next aspect of my presentation. Is social media being used? And social media has, you know, a very -- it lies basically in the spectrum of our rights. And it has a lot of good things and bad things that contributes to the society. When it comes to the rights, you know, a lot of social media companies collect a lot of personal data.

And the privacy aspect is always in question when it comes to large social media companies. They can have surveillance that potentially increase in societies. And social media is also used for propagating fake news and disinformation and misinformation, which is really important.

But what is really important right now is the algorithmic decision making that social media companies use. They harvest people's search history and the patterns and use that data to kind of provide them ads. And those ads are now multi-lingual.

So there is a lot of use of native languages and human languages on social media that is now used for algorithmic decision making. But on the other side, that is also used for spreading the languages because social media is a very pervasive technology. And it's very decentralized and Democratic in that sense a lot of communities use that to promote their own languages.

Communities use social media for engaging with other people that are geographically dispersed. And that sort of is very important for us to kind of have that distinction that how much literacy people have, how much literacy end users have when it comes to using social media and what kind of literacy should be provided on social media. So people know when they give their consent for something rather than just using them and probably being victims of big corporations harvesting that data and using that data against them.

So I would probably summarize what I shared that the future has to be more open and collaborative. And we need to focus a lot on open educational resources and resources that are available to everyone in a free license.

So those resources could include content. It could include educational resources. It could include datasets. But the resources have to be open and there is no sort of negotiation there.

Whoever is producing that resource have to understand that the work that they produce is not just a one-time effort, but that could be used by other communities elsewhere.

And there has to be a lot of effort to kind of ensure that the content that is used for machine translation and other training, the training for the datasets, should focus on using content that is not just, say, religious or cultural but also looks at social justice and other feminist issues.

And there is definitely content in every language around areas that are probably minority, but it is important that those -- that there is a diversity of content that is used for machine training.

And, yeah, so I will probably end there and leave some space for more question and answer and also discussions around this.

>> MODERATOR: Thank you, Subhashish, for bringing this rights content to the discussion today and also talking about literacy and media and information literacy. More importantly, also how it is not just about researchers harnessing data but also giving it back to the community to be able to use and benefit from those resources.

On this note, I would pass on the floor to Elliott who would moderate the conversation. If you have any questions, please let us know. And the floor is yours, Elliott.

>> ELLIOTT MANN: Thank you. We will now enter into a now short period of open discussion time. Feel free to ask questions in the Q&A box in Zoom, or otherwise just in the usual chat.

And as participants, you can raise your hand if you want to speak. I think, Jaewon, you had a question first off.

>> JAEWON SON: Thank you for giving me the floor. I'm one of the organizers today in capacity building.

I would like to just touch upon like the use and social perspective. So just like Dorothy mentioned, I believe a language is endangered if not being passed on to the younger generation. And you don't want to lose the ability to speak the language as well. But without like the proper support from the society, community, and family and education, they wouldn't be able to like maintain their native tongue.

While Africa has a lot of foreign investment in recent years, creating more jobs for local people to get their, the job, they should learn more kind of primary languages like Swahili or English. And I think this has put a lot of pressure on having a survival of the endangered languages which also endangers their cultural heritage and traditional knowledge going through the language.

So in this context, I thought just like Kathleen mentioned, I think it has to be more encouraged to have more AI kind of fellowships and encouraging more AI researchers focusing on the endangered languages for them to have more capacity on having those research and also developing a multi-stakeholder network that strengthens the research on the language technology based on AI technology for Africa languages just like the Open for Good Alliance that Prateek and Christian mentioned. Thank you.

>> ELLIOTT MANN: That is good question how we can best involve young people and get people involved into the research and those sort of technologies for AI.

Does any of the panelists want to take a crack at that one?

>> KATHLEEN SIMINYU: I would like to chime in. Not really in response to that of the question, but just to add the fact that like Jaewon has mentioned, we are increasingly finding that young people in Africa don't speak their mother tongue because individuals are now are growing up in the cities. And whereas in the villages you use your mother tongue a lot in the home context or in more informal situations. Now you find that English and Kiswahili are used which are more regional and not ethnically attributed to any group.

So when there is a desire to learn their mother tongue, there just isn't available pedagogical resources. And I think that's something that we overlook. If I wanted to learn French today, I could download Duolingo and start doing exercises and that would be really easy.

But to start learning my mother tongue today, the process is not as simple. And I'm always saying the technology for this stuff to be created exists. The applications are just not there.

So I think another relationship that could be formed is multi-disciplinary relationships between AI researchers who are collecting this data in their work. But then also, you know, web developers, mobile developers and pedagogical who can start building those applications for this to be made available. Thank you.

>> ELLIOTT MANN: Exactly, I like that part as well. I have spent some time in Indonesia where it is very much like that where you have the national language and then you have different regional languages as well.

And many young people these days growing up are only speaking Indonesian, the national language, and not learning their local dialects. And it can have a big impact even if they want to talk to their family or other people back in their regional areas. That is one area that affects young people especially is learning these different languages, particularly where there isn't basic access through a mobile phone or something.

I do have a couple of policy questions here, which I think we might have a bit of time to discuss. The first part I've got here is where the machine learning techniques and the internet can be used to strengthen multi-lingualism and the availability of information and knowledge in low resource languages.

I suspect everybody here would agree that that is absolutely the case. But I'm personally interested in how everyone here sees -- you know, these technologies have grown incredibly quickly over the last few years. I mean where do you see the next steps as being going forward?

>> ROY BONEY: I would like to see more text-to-speech increase, so it is much easier for a low resource language to engage with.

We've tried it with Cherokee, and it required such a huge amount of data to even start so it's daunting at that point, but maybe it will get better in the future.

>> ELLIOTT MANN: Certainly, that seems like a very good area for innovation. Joyce, did you have something to add?

>> JOYCE NABENDE: I think, as I mentioned earlier, I think we really need to push and have more ways to push out the datasets that we are creating, so they are really available.

I was just reading yesterday around Kiswahili, someone wrote an article and said that we are to blame for the poor translations that Google Translator is giving us in Kiswahili, right, because we are not making an effort as much as possible to push out as much information as possible in Kiswahili such that it can be crawled to make the translations better.

So I think the onus is on us that as we try to create and view as many researches as possible in our local languages that we can push them out as much as possible and that they can contribute around on the internet and make this research available for all the different languages.

>> DOROTHY GORDON: Let me just come in and say that I think it is really important that we not only look at it vis-a-vis the colonial language, but making sure that we use machine learning techniques to help us translate between our own languages. Because that is what actually happens a lot on a daily basis.

For the -- even though the urbanized youth may not master our languages, the bulk of the population conducts most of their business in our languages, they move around our countries, and they need to be able to communicate between African languages.

>> SUBHASHISH PANIGRAHI: If I can add something here.

I work closely with a community in India and their language has a writing system that is standardized in Unicode. So that is good news, they have a Wikipedia that was created two years back.

And many of the community members now use Twitter to create sentence pairs. So it's mostly English and their language. And they are doing that in a collaborative mode very ethnic mode and it's all being kind of saved in a place to use the dataset in the future to create a text-to-speech or a machine translation tool.

But right now they don't even know what tool they are going to use and what tools are going to be available in the future, what is the best open source tool that they could use. But they are just creating these sentence pairs.

I think many communities can start from that basic baby step that they can take that they can create things that are most needed for a larger project. And that could start with a little resource with people that don't probably have much understanding about AI or machine learning techniques but that can contribute to a larger project in the future.

So I think that kind of collaboration. And many community members don't also have enough access to the technique. But they could -- they all have smartphones with them, and with that they can just tweet. And that is pretty simple, and that is something that many elders can do as well. Elders that just have got a smartphone but don't know probably much about the internet, and they can just easily contribute and type two sentences a day.

>> ELLIOTT MANN: Thanks. So I think we're just running out of time, I think we will end on this one question here.

How can documentary linguists collaborate in a concrete manner with AI practitioners to enhance less resourced African languages?

Dorothy, I saw this was a point you raised earlier in chat and certainly it's something that I have seen raised as a sort of theme throughout this session.

How do people feel about the linkages between linguists and AI practitioners in the area? We could leave it as a comment.

>> DOROTHY GORDON: Because you already mentioned I had said something, I want to talk again.

But I definitely I feel we could do better here working with the two groups to actually have a better understanding of each other's field. And that is why I keep pushing for computational linguistics being taught at the university level throughout Africa and bringing together experts from both computer science and from linguistics.

>> ELLIOTT MANN: Certainly. I think that would be a good pathway out of that. Prateek, I'll hand it over. Oh, Kathleen? Yes?

>> KATHLEEN SIMINYU: Yeah, I'd also like to chime in.

I don't know what we can do to tangibly incentive more coordination between academics. I have seen very interesting work, collaboration between linguists and computer scientists where the computer scientist builds some feature cognition models and the linguists use them in language documentation.

So they would go out in the field, collect the data stories from native speakers. And then as an initial path for transcription use the speech recognition tool kits used and then quick edit which made their work much faster, more efficient. And it is a symbiotic relationship because on the other end of things the transcripted recordings then act as additional data to make the speech recognition models much better and perform better in future. So that is one example.

Another is going back to the point of there being a lack of pedagogical processes for language learning in some of the languages. Again, I have come across particularly for the languages I'm working on linguists who have been working for years and years to build pedagogical resources.

And they have unpublished dictionaries and lexicons and all sorts of things which could be useful to computer scientists. The two, I don't know, silos in academia are just not talking or are not connected.

So a lot of us are out here trying to scrape data from the internet but we don't realize that there are places in academia where the data exists. And they on the other end don't have readily available means to make that openly available.

If I got access to dictionary like that, thinking about how I send published data on the web portal, but I have the technical skills to make that possible whereas they do not. Many times it has been as simple as making an introduction and making each party aware of what the other is trying to do.

IGF 2020 - Day 7 - WS 20&122 Exploring the future of endangered languages in cyberspace

Contact information