Dr Chijioke Okorie’s reflections on the 3rd Edition of Hundzula Retreat

In February 2024 (from 6^th to 9^th February), I was at the third edition of the Hundzula: Natural Language Processing and Linguistics Retreat which held at Nelson Mandela University in Port Elizabeth, South Africa. The Hundzula Retreat is an annual gathering of researchers working in Natural Language Processing and Linguistics. My participation was made possible by a grant from Meta to support the work of the Data Science Law Lab.

In this post, I share my reflections on the Hundzula experience, unique for me and also the participants, with me being the only legal researcher/lawyer in the midst of data scientists and computational linguists.

African NLP can help preserve our cultural heritage but gatekeepers can derail progress

Many of the presenters made the point that for African languages, the development of multimodal language datasets is imperative if we want to preserve cultural artefacts. The shared experiences of both researchers working in Natural Language Processing and Linguistics resonated with the notion that AI can revolutionize cultural heritage preservation. However, community leaders and government agencies in charge of language promotion and preservation can engage in unnecessary gatekeeping which may derail progress. Our discussions unpacked questions such as: Who provides (or decides) acceptable, standard terminologies for new objects? How can annotators and linguists help with ensuring clean datasets? How do we factor in dialects into dataset creation? In addressing these issues, there were questions around what the constitutional and statutory role of agencies such as Pan South African Language Board (PanSALB) was, the manner in which PanSALB has exercised its powers and how PanSALB can help promote work in African NLP.

A community is working on NLP…how are they getting things done

One of the striking things about the experiences shared by these NLP and linguistics researchers is how openness and community drives the work they do and leads to the advances made in African NLP so far. Researchers were reaching out to native language speakers to crowdsource textual and audio data and to review collated data for accuracy, etc. When datasets are produced, they are released to the community and the public openly using various open licensing arrangements including Creative Commons licences.

Licensing corpora for African NLP

With financial support from Meta, I hosted a dinner in which we had a fishbowl style discussion. I shared perspectives from law regarding corpora creation, use and reuse especially the unique contexts of South African languages. Property-based regimes such as copyright and contracts may pose data access restrictions. Textual materials could be categorized as literary works and protected by copyright law such that researchers may be risking copyright infringement liability if they use such materials without permission/licence from the relevant copyright owner(s). Likewise, individuals and organisations may use contracts (including terms of service) to restrict use of their corpora.

Apart from property-based regimes, other regimes that are based on liability for non-compliance (e.g., data protection laws like the Protection of Personal Information Act (POPIA) may limit access to available corpora. Some of the questions we considered include:

In terms of copyright law, are exclusive rights being engaged in corpus creation and (re)use?
Would copyright exceptions be applicable? Fair dealing for research purposes? Fair use (as provided for in the Copyright Amendment Bill)?
Would protection of DIY/custom-made corpus by copyright attract POPI Act’s scrutiny?

All these were part of our pre-dinner discussion primer. Slides available here.

African NLP researchers are both licensors and licensees

Per the fishbowl style of engagement, dinner was about sharing day-to-day experiences working on African languages. The experiences shared showed that by crowdsourcing and co-creating datasets, African NLP researchers may have done enough to be considered authors (and first owners) of original literary works (databases). This makes them potential licensors. They may decide to assert and exercise their copyright in those works. But in almost all cases, these datasets are shared openly for everyone to use and reuse for any purpose.

On the other hand, when African NLP researchers use existing datasets or when they need access to existing datasets, they are potential licensees.

We went on to unpack considerations that motivate these researchers as potential licensors to select and decide on suitable licenses; their understanding of key license terms and their preferences (and reasons for those preferences) when it comes to open licenses.

Next steps…

The work here is a continuous one as we continue to explore ways to ensure that law enables (and regulates) work on African NLP. It was incredibly enriching and insightful for me to take Data Science Law Lab to data scientists working in/on Africa and to forge explicit and implicit partnerships to help take their work forward. When NLP and AI work is done by Africans for Africa, we all benefit tremendously.

Stay tuned as we journey on this path.

We are part of the 5 projects supported by Mozilla’s Data Futures Lab!

Cradle Principles and a Research Agenda on Knowledge Governance

Newsletters