charboundary-small / README.md
alea-institute's picture
Update README for small model
4650249 verified
---
language:
- en
tags:
- charboundary
- sentence-boundary-detection
- paragraph-detection
- legal-text
- legal-nlp
- text-segmentation
- cpu
- document-processing
- rag
license: mit
library_name: charboundary
pipeline_tag: text-classification
datasets:
- alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
- alea-institute/kl3m-data-snapshot-20250324
metrics:
- accuracy
- f1
- precision
- recall
- throughput
papers:
- https://arxiv.org/abs/2504.04131
---
# CharBoundary small Model
This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
a fast character-based sentence and paragraph boundary detection system optimized for legal text.
## Model Details
- **Size**: small
- **Model Size**: 3.0 MB (SKOPS compressed)
- **Memory Usage**: 1026 MB at runtime
- **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
- **Model Type**: Random Forest (32 trees, max depth 16)
- **Format**: scikit-learn model (serialized with skops)
- **Task**: Character-level boundary detection for text segmentation
- **License**: MIT
- **Throughput**: ~748K characters/second
## Usage
> **Important:** When loading models from Hugging Face Hub, you must set `trust_model=True` to allow loading custom class types.
>
> **Security Note:** The ONNX model variants are recommended in security-sensitive environments as they don't require bypassing skops security measures with `trust_model=True`. See the [ONNX versions](https://huggingface.co/alea-institute/charboundary-small-onnx) for a safer alternative.
```python
# pip install charboundary
from huggingface_hub import hf_hub_download
from charboundary import TextSegmenter
# Download the model
model_path = hf_hub_download(repo_id="alea-institute/charboundary-small", filename="model.pkl")
# Load the model (trust_model=True is required when loading from external sources)
segmenter = TextSegmenter.load(model_path, trust_model=True)
# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]
# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]
```
## Performance
The model uses a character-based random forest classifier with the following configuration:
- Window Size: 5 characters before, 3 characters after potential boundary
- Accuracy: 0.9970
- F1 Score: 0.7730
- Precision: 0.7460
- Recall: 0.9870
### Dataset-specific Performance
| Dataset | Precision | F1 | Recall |
|---------|-----------|-------|--------|
| ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
| SCOTUS | 0.926 | 0.773 | 0.664 |
| Cyber Crime | 0.939 | 0.837 | 0.755 |
| BVA | 0.937 | 0.870 | 0.812 |
| Intellectual Property | 0.927 | 0.883 | 0.843 |
## Available Models
CharBoundary comes in three sizes, balancing accuracy and efficiency:
| Model | Format | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score |
|-------|--------|-----------|-------------|------------------------|----------|
| Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 | ~748K | 0.773 |
| Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 | ~587K | 0.779 |
| Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 | ~518K | 0.782 |
## Paper and Citation
This model is part of the research presented in the following paper:
```
@article{bommarito2025precise,
title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
journal={arXiv preprint arXiv:2504.04131},
year={2025}
}
```
For more details on the model architecture, training, and evaluation, please see:
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)
## Contact
This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai).
For technical support, collaboration opportunities, or general inquiries:
- GitHub: https://github.com/alea-institute/kl3m-model-research
- Email: hello@aleainstitute.ai
- Website: https://aleainstitute.ai
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)