IGF 2019 WS #30 Let there be data – Exploring data as a public good

Organizer 1: Government, Western European and Others Group (WEOG)
Organizer 2: Technical Community, African Group

Speaker 1: Renata Avila, Civil Society, Latin American and Caribbean Group (GRULAC)
Speaker 2: Alex Klepel, Private Sector, Western European and Others Group (WEOG)
Speaker 3: Audace Niyonkuru, Technical Community, African Group
Speaker 4: Baratang Miya, Civil Society, African Group
Speaker 5: Said Ngoga Rutabayiro, Government, African Group

Format

Break-out Group Discussions - Flexible Seating - 90 Min

Policy Question(s)

• How can we support the development of digital public goods such as common data infrastructures to train artificial intelligences, e.g. for voice recognition technology in underrepresented languages? • How can we develop sustainable governance models for data commons based on a multi-stakeholder approach? • Which role can data commons play as an instrument of innovation policy and means to stimulate supply and demand for innovative technological solutions?

SDGs

GOAL 1: No Poverty
GOAL 8: Decent Work and Economic Growth
GOAL 9: Industry, Innovation and Infrastructure
GOAL 10: Reduced Inequalities
GOAL 11: Sustainable Cities and Communities
GOAL 12: Responsible Production and Consumption
GOAL 17: Partnerships for the Goals

Description: Data is mostly seen as a tool: for decision-making, micro-targeted advertising, surveillance, and in some cases for social good, e.g. to increase transparency. However, data nowadays is also an infrastructure critical to social and economic development. Especially for the training of artificial intelligences, the availability of high quality data is crucial and one of the main barriers for the development of local AI-based solutions, especially in the global South where resources to acquire data are scarce. Both the availability of training data and AI-based solutions as such can play a major role in addressing current inequalities regarding access to knowledge, services and the diversity of cultural expressions. Exemplary for impact-driven AI-based solutions is voice interaction: it has the potential to enable millions of people access to information and services they do not have yet, preserve cultural heritage, make technology more inclusive and ultimately foster local value creation as well as digital sovereignty. In this session, we would like to explore different initiatives aiming at creating data commons and digital public goods to learn from their successes and challenges. We will discuss various governance models and ecosystem approaches such as community-governance and multi-stakeholder models with the aim to democratize the potential of artificial intelligence for all.

Expected Outcomes: - shared lessons learned and good practices for the development of digital public goods, especially data commons - mapping of different governance models for data commons and the respective roles government, private sector and civil society play in such an ecosystem - discuss the economic impact data commons potentially have as a means to stimulate the development and demand for innovative AI-based solutions amongst stakeholders

The session will consist of a short series of initial inputs from each of the speakers (5-7 minutes each), which will be followed by an interactive round of discussions in smaller groups (potentially in a "world café format") of approximately 40 minutes, each of them hosted by one of the speakers. The results of these "breakout sessions" will be brought together in the last 15-20 minutes of the workshop. At the beginning of the session, we will also use Slido or a similar tool to collect open questions and comments of participants, which then will be addressed during the workshop.

Relevance to Theme: Today, applications, which use artificial intelligence or automated decision making, are mostly developed by Western companies and in China. A big part of the world, notably people living in the global South, are excluded both from the development of these applications as well as from being represented in the data used to train artificial intelligences. One example is voice recognition technology: In local languages, this technology has the potential to enable underrepresented groups access to information, services and the diversity of cultural expressions. It is essential for an inclusive and diverse information society, and will play a major role in human-machine-interaction in the future. However, due to economic reasons, corporations are focusing on mainstream languages such as English and Chinese, leaving the majority of people in the global South underserved and excluded. By discussing means to develop (local) data pools as commons, we are focusing on the open provision of training data as a crucial precondition for (local) developers to build inclusive AI-based applications and thereby close the digital divide we see today in the development and use of artifical intelligence.

Relevance to Internet Governance: The development of inclusive and ethical AI-based applications requires both a normative framework and shared resources, which enable more people to build applications relevant to their local context. For instance, voice recognition technology in local languages is oftentimes lacking a business case to justify investments in collecting data and the training of models, even if the potential for digital inclusion is staggering. Building data commons thus takes away high investments needed by one stakeholder and bases the development of locally relevant AI applications on a multi-stakeholder model with shared responsibilities. It is these governance models for data commons and the respective roles governments, private sector and civil society can play within it, which we would like to discuss during the session.

Online Participation

At the beginning of the session we will use the tool to collect comments and questions from remote participants, which then will be addressed during the workshop. During the breakout sessions, each discussion group will use a laptop to ensure that remote participants can follow and take part in the discussion. During the wrap-up phase of the workshop, we will use the tool to ensure that remote participants will have the possibility to share their perspective with the bigger group.

Proposed Additional Tools: We will use Slido or a similar tool, which enables polls and the rating of questions and comments from the audience.

1. Key Policy Questions and Expectations

• How can we support the development of digital public goods such as common data infrastructures to train artificial intelligences, e.g. for voice recognition technology in underrepresented languages?

• How can we develop sustainable governance models for data commons based on a multi-stakeholder approach?

• Which role can data commons play as an instrument of innovation policy and means to stimulate supply and demand for innovative technological solutions?

2. Summary of Issues Discussed

The group discussed the areas of (1) institutions and data governance structured needed to govern and maintain the commons successfully, (2) incentivising structures and community engagament mechanisms for the collection of open data (supply) and how to build an ecosystem around them to stimulate the use of these datasets (demand) and (3) private vs. personal data ownership and the rights of the data holder.

In the discussion, the group tended towards a data governance model in the sense of "commons" as opposed to "public goods". There was a controversy around data ownership: One participant held the view that all data are intangible assets. But if data is the new oil, we have to study what oil actually did to people. Other participants held the view that data should not be a commodity at all, rather a common infrastructure. Also, sharing data means to give something away, benefits need to be returned to the communities who are the source of data (which is seldomly the case). The key to collect high quality data and use it effectively is by having more data commons and having capacity building for us to be able to use it.

3. Policy Recommendations or Suggestions for the Way Forward

The group discussed two key policy questions regarding data governance: (1) whether to aspire for data as a commons in the sense that a community will decide about all governance questions and collectively maintain the data vs. data as public good that is maintained by the stateThere is a need to clarify non-profit vs for profit-uses of data. Background: One participant held the view that all data are intangible assets. Individuals can give data in exchange for a service. Companies transform data to money through analysis – and offer customers (you) the product. (2) It was discussed how value can be created from data commons and data as public goods. While open data in theory is available to all, creating value from it requires economic and technical means that are unequally distributed. To level the playing field, it is not enough to invest in data collection. (Policy) solutions are needed to democratize the tools needed to extract value from data, that is, e.g. skills building and investment in high-value public datasets. At the same time, building an ecosystem around a public good/commons should follow potential use cases from the beginning.

4. Other Initiatives Addressing the Session Issues

Data commons initiatives mentioned included the collection of open voice data through Mozilla Common Voice Project and the collection of accessibility information based on the Open Streetmap wheelmap.org. On the policy level, the example of a systematic judicial policy on open data in Brazil was mentioned as well as proposals to regulat data sharing for SMEs on the EU level.

5. Making Progress for Tackled Issues
  • Capacity building to build demand for data commons
  • Strengthen data user communities, i.e. journalism, science – not everyone can become a data scientist
  • Crisis as driver of change” approach: Create an ecosystem to solve concrete global problems, like climate - build commons around concrete use cases with a high level of interest from different stakeholders
  • tackle the disconnection between data and the subject in data collection: While raw data is often directly associated with a person, the whole dataset is conceptulized as already new intellectual work with different principles
6. Estimated Participation

80 participants, of which around 40% have been women

7. Reflection to Gender Issues

Gender was discussed in terms of biases in existing and newly collected data sets. Even if data is crowdsourced, biases will prevail. One concrete example: open voice data is heavily biased towards male speakers.