A dataset to finetune large language models
Im Rahmen dieser Challenge wird die Schaffung eines offen zugänglichen Legal-Instruct-Datensatzes vorgeschlagen, um grosse Sprachmodelle für den juristischen Bereich zu finetunen. Dieser Datensatz würde juristische Anweisungen in einem strukturierten und organisierten Format enthalten, das es LLMs ermöglicht, juristische Sprache und Aufgaben besser zu verstehen. Wir glauben, dass dieser Datensatz die Fähigkeit der Rechtsbranche, die Fähigkeiten von LLM zu nutzen, erheblich verbessern könnte, was letztlich zu besseren rechtlichen Ergebnissen für alle führen würde.
Lets unlock the potential of large language models in the legal industry: We propose creating an openly available Legal-Instruct Dataset.
This challenge proposes the creation of an openly available Legal-Instruct dataset to finetune large language models for the legal field. This dataset would contain legal instructions in a structured and organized format, allowing LLMs to better understand legal language and tasks. By doing so, we believe that this dataset could significantly improve the legal industry's ability to leverage LLM's capabilities, ultimately resulting in better legal outcomes for all.
💡 Idea behind the challenge:
Instruction finetuning large language models (LLMs) has shown huge potential (see InstructGPT by OpenAI[^1]) this has further been improved by synthetically generated instructions using the self instruct[^2] method (see for example the Stanford Alpaca model. By instruction tuning smaller LLMs, those models are able to solve complex tasks. Most LLMs have some legal texts in their trainingdata, but will often mix different legal sources in their outputs (especially for Switzerland this is an issue, because German texts for the Swiss legal system will often be mixed with legal information from Austria and Germany, for French and Italian it's the same issue). Generating synthetic-only data is (apart from API usage limitations by for example OpenAI) therefore also not possible in the legal domain, as those mixtures would be present in the generated data as well.
Creating a legal-instruct dataset based on Swiss data is essential for finetuning large language models (LLMs) that are used in the (swiss) legal industry. This dataset would contain legal instructions in a structured and organized format (following the alpaca format), if possible in German, French and Italian. It would allow LLMs to learn the language and nuances of our legal system and legal tasks. By having access to this comprehensive dataset, LLMs would be able to better understand legal language and improve their ability to accurately perform legal tasks.
Moreover, it is important that this dataset is openly available and allows for commercial use (the example of Alpaca is intended only for academic research and commercial use is prohibited). By creating an openly available and usable dataset, it would allow for broad implementation of the dataset and ensure that every open-source LLM could benefit from its use. This would not only benefit legal professionals (as many models could use that trainingdata freely) but also companies developing LLMs for the legal industry. An openly available legal-instruct dataset would also encourage collaboration and innovation in the field, ultimately leading to better outcomes for all.
🎯 Goal of the challenge
- Create and release a legal-instruct dataset to huggingface with an open license.
- Finetune a LLM with the created dataset.
- Writing clauses and legal texts
- Legal CoT (Chain of Thought) prompting
- QnA (simple or retrieval augmented)
This will have to be discussed at the hackathon.
- [x] Simple platform to collect data (already in progess).
- [ ] Datacollection (including possibilities of automatically creating instructions from existing data).
- [ ] Define a (very) simple benchmark to see what outputs would be considered as good or bad.
- [ ] Select suitable LLM to finetune (Camel 5B, Dolly 2.0, Open Assistant (Pythia base), T0pp or FLAN-ul2)
- [ ] Finetuning an LLM (LoRA finetuning, as this could probably be achieved in the limitted time of the hackathon and wouldn't need expensive hardware).
- [ ] Review of the model.
General (multilingual) Instruction Datasets:
Dataset for Dolly 2.0 from Databricks by Argila on Huggingface
OpenAssistant Conversations Dataset (OASST1) by Open Assistant on Huggingface
[^1]: Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. ‘Training Language Models to Follow Instructions with Human Feedback’. ArXiv [Cs.CL], 2022. arXiv. http://arxiv.org/abs/2203.02155. And OpenAI Blog (last visitted on Mon. 17th of April 2023)
[^2]: Wang, Yizhong, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. ‘Self-Instruct: Aligning Language Model with Self Generated Instructions’. ArXiv [Cs.CL], 2022. arXiv. http://arxiv.org/abs/2212.10560.