charboundary-small / README.md

Update README for small model

4650249 verified 4 months ago

5.25 kB

	---
	language:
	- en
	tags:
	- charboundary
	- sentence-boundary-detection
	- paragraph-detection
	- legal-text
	- legal-nlp
	- text-segmentation
	- cpu
	- document-processing
	- rag
	license: mit
	library_name: charboundary
	pipeline_tag: text-classification
	datasets:
	- alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
	- alea-institute/kl3m-data-snapshot-20250324
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- throughput
	papers:
	- https://arxiv.org/abs/2504.04131
	---

	# CharBoundary small Model

	This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
	a fast character-based sentence and paragraph boundary detection system optimized for legal text.

	## Model Details

	- Size: small
	- Model Size: 3.0 MB (SKOPS compressed)
	- Memory Usage: 1026 MB at runtime
	- Training Data: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
	- Model Type: Random Forest (32 trees, max depth 16)
	- Format: scikit-learn model (serialized with skops)
	- Task: Character-level boundary detection for text segmentation
	- License: MIT
	- Throughput: ~748K characters/second

	## Usage

	> Important: When loading models from Hugging Face Hub, you must set `trust_model=True` to allow loading custom class types.
	>
	> Security Note: The ONNX model variants are recommended in security-sensitive environments as they don't require bypassing skops security measures with `trust_model=True`. See the [ONNX versions](https://huggingface.co/alea-institute/charboundary-small-onnx) for a safer alternative.

	```python
	# pip install charboundary
	from huggingface_hub import hf_hub_download
	from charboundary import TextSegmenter

	# Download the model
	model_path = hf_hub_download(repo_id="alea-institute/charboundary-small", filename="model.pkl")

	# Load the model (trust_model=True is required when loading from external sources)
	segmenter = TextSegmenter.load(model_path, trust_model=True)

	# Use the model
	text = "This is a test sentence. Here's another one!"
	sentences = segmenter.segment_to_sentences(text)
	print(sentences)
	# Output: ['This is a test sentence.', " Here's another one!"]

	# Segment to spans
	sentence_spans = segmenter.get_sentence_spans(text)
	print(sentence_spans)
	# Output: [(0, 24), (24, 44)]
	```

	## Performance

	The model uses a character-based random forest classifier with the following configuration:
	- Window Size: 5 characters before, 3 characters after potential boundary
	- Accuracy: 0.9970
	- F1 Score: 0.7730
	- Precision: 0.7460
	- Recall: 0.9870

	### Dataset-specific Performance

	\| Dataset \| Precision \| F1 \| Recall \|
	\|---------\|-----------\|-------\|--------\|
	\| ALEA SBD Benchmark \| 0.624 \| 0.718 \| 0.845 \|
	\| SCOTUS \| 0.926 \| 0.773 \| 0.664 \|
	\| Cyber Crime \| 0.939 \| 0.837 \| 0.755 \|
	\| BVA \| 0.937 \| 0.870 \| 0.812 \|
	\| Intellectual Property \| 0.927 \| 0.883 \| 0.843 \|

	## Available Models

	CharBoundary comes in three sizes, balancing accuracy and efficiency:

	\| Model \| Format \| Size (MB) \| Memory (MB) \| Throughput (chars/sec) \| F1 Score \|
	\|-------\|--------\|-----------\|-------------\|------------------------\|----------\|
	\| Small \| [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) \| 3.0 / 0.5 \| 1,026 \| ~748K \| 0.773 \|
	\| Medium \| [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) \| 13.0 / 2.6 \| 1,897 \| ~587K \| 0.779 \|
	\| Large \| [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) \| 60.0 / 13.0 \| 5,734 \| ~518K \| 0.782 \|

	## Paper and Citation

	This model is part of the research presented in the following paper:

	```
	@article{bommarito2025precise,
	title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
	author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
	journal={arXiv preprint arXiv:2504.04131},
	year={2025}
	}
	```

	For more details on the model architecture, training, and evaluation, please see:
	- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
	- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
	- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)

	## Contact

	This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai).

	For technical support, collaboration opportunities, or general inquiries:

	- GitHub: https://github.com/alea-institute/kl3m-model-research
	- Email: hello@aleainstitute.ai
	- Website: https://aleainstitute.ai

	For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or
	create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).

	![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)