Spaces:

jbnayahu
/

bluebench

Running

App Files Files Community

jbnayahu commited on 19 days ago

Commit

a6a0e86

unverified ·

1 Parent(s): 3ee3ca7

.

Browse files

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>

Files changed (1) hide show

src/about.py +1 -1

src/about.py CHANGED Viewed

@@ -82,7 +82,7 @@ table th:nth-of-type(3) {
 | Bill Summarization      | <pre><p><b>BillSUM</b></p>[Dataset](https://huggingface.co/datasets/FiscalNote/billsum), [Paper](https://aclanthology.org/D19-5406/), [Unitxt Card](https://www.unitxt.ai/en/stable/catalog/catalog.cards.billsum.html)</pre>                                                     | <p>Summarization of US Congressional and California state bills.</p>The data consists of three parts: US training bills, US test bills and California test bills. The US bills were collected from the Govinfo service provided by the United States Government Publishing Office (GPO) under CC0-1.0 license. The California, bills from the 2015-2016 session are available from the legislature’s website. |
 | Post Summarization      | <pre><p><b>TL;DR</b></p>[Dataset](https://huggingface.co/datasets/webis/tldr-17), [Paper](https://aclanthology.org/W17-4508/), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.tldr.html)</pre>                                                               | <p>Summarization dataset,</p>A large Reddit crawl, taking advantage of the common practice of appending a “TL;DR” to long posts. |
 | RAG Response Generation | <pre><p><b>ClapNQ</b></p>[Dataset](https://huggingface.co/datasets/PrimeQA/clapnq), [Paper](https://arxiv.org/abs/2404.02103), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.rag.response_generation.clapnq.html)</pre>                                     | <p>A benchmark for Long-form Question Answering.</p>CLAP NQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The CLAP NQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. |
-| QA Finance              | <pre><p><b>FinQA</b></p>[Dataset](https://huggingface.co/datasets/ibm/finqa), [Paper](https://arxiv.org/abs/2109.00122), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.fin_qa.html)</pre>                                                                   | <p>A large-scale dataset with 2.8k financial reports for 8k Q&A pairs to study numerical reasoning with structured and unstructured evidence.</p>The FinQA dataset is designed to facilitate research and development in the area of question answering (QA) using financial texts. It consists of a subset of QA pairs from a larger dataset, originally created through a collaboration between researchers from the University of Pennsylvania, J.P. Morgan, and Amazon.The original dataset includes 8,281 QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019 (FinQA: A Dataset of Numerical Reasoning over Financial Data.). This subset, specifically curated by Aiera, consists of 91 QA pairs. Each entry in the dataset includes a context, a question, and an answer, with each component manually verified for accuracy and formatting consistency. |
 ## Reproducibility
 BlueBench is powered by the <a href="https://www.unitxt.ai">unitxt</a> library. To reproduce our results, start by installing Unitxt in a clean Python 3.10 virtual environment, along with the required dependencies:

 | Bill Summarization      | <pre><p><b>BillSUM</b></p>[Dataset](https://huggingface.co/datasets/FiscalNote/billsum), [Paper](https://aclanthology.org/D19-5406/), [Unitxt Card](https://www.unitxt.ai/en/stable/catalog/catalog.cards.billsum.html)</pre>                                                     | <p>Summarization of US Congressional and California state bills.</p>The data consists of three parts: US training bills, US test bills and California test bills. The US bills were collected from the Govinfo service provided by the United States Government Publishing Office (GPO) under CC0-1.0 license. The California, bills from the 2015-2016 session are available from the legislature’s website. |
 | Post Summarization      | <pre><p><b>TL;DR</b></p>[Dataset](https://huggingface.co/datasets/webis/tldr-17), [Paper](https://aclanthology.org/W17-4508/), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.tldr.html)</pre>                                                               | <p>Summarization dataset,</p>A large Reddit crawl, taking advantage of the common practice of appending a “TL;DR” to long posts. |
 | RAG Response Generation | <pre><p><b>ClapNQ</b></p>[Dataset](https://huggingface.co/datasets/PrimeQA/clapnq), [Paper](https://arxiv.org/abs/2404.02103), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.rag.response_generation.clapnq.html)</pre>                                     | <p>A benchmark for Long-form Question Answering.</p>CLAP NQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The CLAP NQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. |
+| QA Finance              | <pre><p><b>FinQA</b></p>[Dataset](https://huggingface.co/datasets/ibm/finqa), [Paper](https://arxiv.org/abs/2109.00122), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.fin_qa.html)</pre>                                                                   | <p>A large-scale dataset with 2.8k financial reports for 8k Q&A pairs to study numerical reasoning with structured and unstructured evidence.</p>The FinQA dataset is designed to facilitate research and development in the area of question answering (QA) using financial texts. It consists of a subset of QA pairs from a larger dataset, originally created through a collaboration between researchers from the University of Pennsylvania, J.P. Morgan, and Amazon. The dataset includes 8,281 QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019. |
 ## Reproducibility
 BlueBench is powered by the <a href="https://www.unitxt.ai">unitxt</a> library. To reproduce our results, start by installing Unitxt in a clean Python 3.10 virtual environment, along with the required dependencies: