IGF 2023 WS #217 Large Language Models on the Web: Anticipating the challenge

Time
Thursday, 12th October, 2023 (01:30 UTC) - Thursday, 12th October, 2023 (03:00 UTC)
Room
WS 3 – Annex Hall 2
Subtheme

Artificial Intelligence (AI) & Emerging Technologies
Chat GPT, Generative AI, and Machine Learning

Organizer 1: Sara Berger, IBM
Organizer 2: Diogo Cortiz da Silva, 🔒Brazilian Network Information Center - NIC.br
Organizer 3: Yuki Arase, Osaka University
Organizer 4: Reinaldo Ferraz, 🔒NIC.br
Organizer 5: Ana Eliza Duarte, NIC.br

Speaker 1: Santana Vagner, Private Sector, Western European and Others Group (WEOG)
Speaker 2: Yuki Arase, Civil Society, Asia-Pacific Group
Speaker 3: Rafael Evangelista, Civil Society, Latin American and Caribbean Group (GRULAC)
Speaker 4: Emily Bender, Civil Society, Western European and Others Group (WEOG)
Speaker 5: Dominique Hazaël-Massieux, Technical Community, Western European and Others Group (WEOG)

Moderator

Diogo Cortiz da Silva, Technical Community, Latin American and Caribbean Group (GRULAC)

Online Moderator

Ana Eliza Duarte, Technical Community, Latin American and Caribbean Group (GRULAC)

Rapporteur

Matheus Petroni, Civil Society, Latin American and Caribbean Group (GRULAC)

Format

Round Table - 90 Min

Policy Question(s)

A. What are the limits of scraping web data to train LLMs and what measures should be implemented within a governance framework to ensure privacy, prevent copyright infringement, and effectively manage content creator consent? B. What are the potential risks and governance complexities associated with incorporating LLMs into search engines as chatbot interfaces and how should different regions (i.e Global South) respond to the impacts on web traffic and, consequently, the digital economy? C. What are the technical and governance approaches to detect AI-generated content posted on the Web, restrain the dissemination of sensitive content and provide means of accountability?

What will participants gain from attending this session? The workshop will introduce a technical and governance debate about LLM focusing on the Web. Although there are other spaces for discussion on AI governance, this session focuses on raising concerns about the complexity of ethics and governance when incorporating LLMs into the Web ecosystem. Through an interdisciplinary and diverse approach (speakers are from different backgrounds, stakeholder groups, regions, and include a person with a disability), the panel will provide the necessary technical knowledge of how LLMs have the potential to change the Web and some possible governance consequences, such as: the challenge of privacy and consent in data collection, fundamental changes in how users search for information, potential impacts to the digital and physical economy and how to manage content production using AI. The workshop will stimulate the audience to critically think about and question LLMs' emerging governance concerns, especially in the Web context.

Description:

One of the leading generative AI approaches is the so-called Large Language Models (LLMs), complex models capable of understanding and generating texts in a coherent and contextualized way. Chatbots powered by this technology are becoming popular and disrupting different areas by offering a general-purpose conversational interface. LLMs improve the user experience, accessibility, and search functionalities on the Web. However, their integration raises governance and ethical concerns. This workshop will focus on three perspectives: data collection for model training, LLM integration into search engines, and the content production process. A language model is only as good as the quantity and quality of the data that feed it. One of the strategies for creating a training dataset is the large-scale and indiscriminate scraping of content on the Web. It raises questions about ethics and governance, such as user consent, privacy, copyright violations, misinformation, and social and cultural bubbles reinforcement. LLMs are changing the way people search for web content. Some users are using LLM-powered chatbots as primary data sources due to their ease of use and direct answers. Search engines also incorporate LLMs and conversational interfaces, potentially reducing access to the original content. This shift can impact web traffic and disrupt the dynamics of the digital economy. There are also concerns about biases, cultural under-representation of some regions, and possible manipulation of information. Generative AI also changes how people produce content. While LLMs may increase productivity in some circumstances, they can facilitate the production of false content and misinformation. In this sense, it seems reasonable to discuss strategies to facilitate transparency, explainability, and accountability of AI-generated content. The impact of LLMs on the Web will be transformative, and this workshop will provide a space to anticipate technical, ethical, and governance matters from an interdisciplinary perspective.

Expected Outcomes

Our workshop will introduce the theme to the IGF agenda and inform the audience about the challenges of integrating LLMs into web platforms. It is an emerging topic, so this workshop will contribute to anticipating potential use cases, risks, governance challenges, and possible mitigation approaches. We will inform the audience about the new trends, potential critical uses of integrating LLMs and Web services, and the ethical and governance challenges. We also plan to elaborate and provide a list of recommendations from speakers and participants to introduce a multi-stakeholder perspective about the impacts of LLMs on the Web to guide the local policy agenda.

Hybrid Format: The session will be structured into three main segments: an introduction, a discussion, and an interaction with the audience. During the first segment, each speaker will have 5 minutes to present their perspective on the topic, providing a multi stakeholder view. The second segment will feature the speakers sharing their opinions on the policy questions. The final segment will involve interaction between the audience and the speakers through a Q&A session. The online moderator will collect questions from the audience and share them with the onsite moderator, who will then distribute them among the speakers. To ensure a smooth session, we plan to hold an online meeting with all the speakers one week prior to the IGF event. This will allow us to align the interventions of the speakers based on their participation format, whether online or in person.

Key Takeaways (* deadline 2 hours after session)

Open-source in AI is an important instrument to democratize solutions. But only open-sourcing the model is not enough if the communities around the world don't have information about the dataset used for training.

Local populations (i.e. Global South) should be involved in all stages of the development of AI products that will reach their markets.

Call to Action (* deadline 2 hours after session)

It should be a priority to develop standards/watermarks to track synthetic content generated by AI.

It is strategic to establish an economic framework that remunerates content creators in the era of generative artificial intelligence.

Session Report (* deadline 26 October) - click on the ? symbol for instructions

Speakers: Vagner Santana, Yuki Arase, Rafael Evangelista, Emily Bender, Dominique Hazaël-Massieux, Ryan Budish
Moderator: Diogo Cortiz da Silva
Online Moderator: Ana Eliza Duarte
Rapporteur: Matheus Petroni Braz

Key Takeaways

  • Open-source in AI is an important instrument to democratize solutions. But only open-sourcing the model is not enough if the communities around the world don't have information about the dataset used for training.
  • Local populations (i.e Global South) should be involved in all the stages of the development of AI products that will reach their markets.

Call to action points

  • It should be a priority to develop standards/watermarks to track synthetic content generated by AI.
  • It is strategic to establish an economic framework that remunerates content creators in the era of Generative Artificial Intelligence.

Report

Diogo Cortiz (Researcher at Web Technology Study Center [Ceweb.br] and professor at Pontifical Catholic University of São Paulo [PUC-SP]), opened the session by introducing the theme of the discussion about large language models on the web, trying to focus on some of the technical aspects about how Generative AI could impact the web.

The moderator explained three main dimensions to structure the workshop: The first dimension is regarding data mining from web content, the second about what happens when we incorporate generative AI chatbots in search engines and the last one about the web as the main platform where generative AI content is posted.

Each dimension has its own policy question to guide the discussion:

  1. What are the limits of scraping web data to train LLMs and what measures should be implemented within a governance framework to ensure privacy, prevent copyright infringement, and effectively manage content creator consent?
  2. What are the potential risks and governance complexities associated with incorporating LLMs into search engines as chatbot interfaces and how should different regions (i.e Global South) respond to the impacts on web traffic and, consequently, the digital economy?
  3. What are the technical and governance approaches to detect AI-generated content posted on the Web, restrain the dissemination of sensitive content and provide means of accountability?

Emily Bender, professor at University of Washington, responded to the first question with concerns about the ease of the global process of grabbing data from the web and claiming it as your own. She advocates for a consensual technology approach, emphasizing that data should be collected in a meaningful, opt-in manner. For her, data collection needs to be intentional rather than a massive, undocumented process. She also believes this approach is not representative, as she argues that the internet does not provide a neutral viewpoint of the world. In response to the second question, she believes that the risks of incorporating large language models (LLMs) into search engines are significantly high because LLMs cannot be trusted as sources of information. Rather, they should be seen as sources of organized words. Specifically for the Global South, the speaker shares a tweet with a graphic that underscores how the data used to train LLMs is concentrated in the Global North, resulting in a significant lack of representation and potential misinterpretation of the Global South within these models. The speaker goes on to highlight that the outputs of LLMs could pollute the information ecosystem with synthetic media that does not represent the truth, but can appear authentic due to the polished structure of the content. Addressing the third question, the speaker concludes their initial remarks by advocating for watermarking synthetic media at the moment of its creation, as this cannot be effectively done at a later time in the current scenario. They also emphasize the importance of policies to ensure a less polluted information space.

Yuki Arase, professor at University of OSaka,  started affirming her alignment with Emily's remarks. For the first question, she highlighted how messages are being massively used to train LLM and showed concerns about biases and hate speech, as well as how unbalanced the training data are in relation to the greater diversity of people's characteristics around the world. For the second question, the speaker said that as generative AI tools became more popular and easy to access, it also increased the chances of people taking their results for granted, not checking fonts or the veracity of the information. She indicates that having a way to link the content to their data sources could be a way to solve this. For the third question, the speaker reinforced the necessity of training data with local perspectives and in a variety of languages.

Vagner Santana, researcher at IBM, introduced his remarks about the first question with a storyline of the Web and how the next generation of Web3+ could respond to the LLMs black boxes and "retrain” with its own data in the future. There's even a risk of people creating pages just to poison LLMs training data. Regarding the second question, the speaker shared his opinion about possible human replacements by LLMs, showing that we will need to rethink the ways content creators are remunerated in this new era. For the third question, he defended the accountability of generated content and the understanding of how the technology works as a first step, as well as a moral entanglement concept extended to the content created by those technologies in an attempt to minimize misusage. The speaker also affirmed that the higher the distance between creation and use, the more impersonal technology is. As final takeaways, the speaker reinforced his vision of studying how technology is used, including different contexts and repurposed applications, as well as putting responsible innovation into practice.

Ryan Budish, public policy manager at Meta, started his remarks by highlighting some recent applications launched by Meta such as translation for a huge variety of languages around the globe and automatic speech translators. He also highlighted how making some LLMs available to researchers is generating really interesting results and new applications. In his point of view, the main question is not whether the technology is good or bad, but what it can be used for. The speaker spoke about the immense amount of data needed to train these models if we want them to have better results, justifying the need to use web available data for it alongside other kinds of databases. Regarding the first question, he talked about privacy and showed examples of how this is applied in Meta's new generative AI products as measures to exclude some websites that commonly share personal information and not use users private information for it. For the second question, the speaker discussed Meta's efforts related to open source AI perspectives and how this could improve AI models for everyone, including economic and competitive benefits. Regarding the third question, he spoke about Meta's vision regarding watermarks and the challenges of creating technical solutions for this kind of application, showing some examples of how Meta is mitigating the possible misuse of these tools by bad actors. In conclusion, the speaker shared a vision about how the governance of this technology should look, supporting the principled risk-based, technology-neutral approaches to regulation of AI.

Dominique Hazaël-Massieux, W3C, replied to the first question differentiating search engines from LLMs, mostly regarding the black box context of this last one, where you can not have a direct link to the source of that answer. In his perspective, it is fundamental to think about permission strategies regarding data scraped from the web for training LLMs, thinking about something more robust than just robots that block some data from being used. Regarding privacy issues, the speaker highlighted another important difference between search engines and LLMs: in the first, there is the possibility of the “right to be forgotten”, where you can remove specific contents, something that is not feasible to do with the current LLMs. Regarding the second question, he endorsed Emily's remarks about not facing LLMs as a source of reliable or checkable information. The speaker invited the public to think about the stakeholders that should be involved in policymaking for generative AI solutions, balancing the approaches to not harm innovation at the same time that important regulation measures are ensured. Concluding with the third question, the speaker also agrees that it is a challenging technical question, even more so when we think about hybrid contents where AI provided a first version or correction to previously existing content. He also defended the idea that LLMs are not generating the problem of misinformation and fake news, but could make it scalable and worse. In conclusion, the speaker indicates as a solution the collaboration between technologists, researchers and regulatory bodies to discuss the quality of these generated contents and possible governance directions.

Rafael Evangelista, from Brazilian Internet Steering Committee and University of Campinas, São Paulo, started the discussion talking about the proliferation of low-quality online content, largely driven by the financial incentives of the digital advertising ecosystem. He discussed how this poses significant threats to democracy, as exemplified in Brazil's 2018 elections, where misleading content was amplified through instant messaging groups and monetized through advertising. For the speaker, economic disparities and currency fluctuations in the Global South drive people, including young professionals, to produce subpar content for income, affecting alternative media and lowering content quality. He also affirms that the rise of LLMs raises even more concerns about the spread of low-quality information, suggesting a reevaluation of the compensation structures to redirect wealth from tech corporations to support high-quality journalism and fair compensation for content creators. This extends to open-access scientific journals for LLM training, recognizing collective knowledge as common rather than restricting access or individual compensation. To conclude, the speaker affirms that shifting from individual compensation to collective knowledge production through public digital infrastructures is vital to addressing global North-South disparities and minimizing the potential misuse of LLMs in weakly regulated markets.