Abdeljalil El Majjodi - Data team
Abdelaziz Bounhar - Data team
AL Atlas: Moroccan Darija Pretraining
We present a comprehensive dataset for Moroccan darija, addressing the lack of resources for this widely spoken dialect. We detail our collection methodology, provide thorough data analysis, and demonstrate performance improvements in both masked and causal language models after training on this dataset.
Read more