BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction Paper β’ 2503.19658 β’ Published 28 days ago β’ 2
AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization Paper β’ 2503.22526 β’ Published 25 days ago β’ 2
TextBite: A Historical Czech Document Dataset for Logical Page Segmentation Paper β’ 2503.16664 β’ Published Mar 20 β’ 2
Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages Paper β’ 2502.10140 β’ Published Feb 14 β’ 9
view post Post 3286 πΈπ° Hovorte po slovensky? Help build better AI for Slovak! We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release! Your contribution will help create better language models for 5+ million Slovak speakers.Annotate here: data-is-better-together/fineweb-c.Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community See translation 3 replies Β· β€οΈ 10 10 π€ 1 1 π 1 1 π 1 1 + Reply