Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,62 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- ur
|
5 |
+
- en
|
6 |
+
tags:
|
7 |
+
- nlp
|
8 |
+
- urdunlp
|
9 |
+
- algoritham
|
10 |
+
---
|
11 |
+
Brown Hierarchical Word Clustering Model Card for Urdu and English
|
12 |
+
Model Description
|
13 |
+
This implementation of the Brown hierarchical word clustering algorithm groups words into clusters based on their distributional similarity in text. The algorithm creates a binary tree of word clusters, where words that appear in similar contexts are grouped together. This version has been applied to both Urdu and English text data.
|
14 |
+
|
15 |
+
Model Details
|
16 |
+
Developed by: Percy Liang (original implementation)
|
17 |
+
Model type: Unsupervised hierarchical word clustering algorithm
|
18 |
+
Languages: Urdu and English
|
19 |
+
Version: 1.3
|
20 |
+
Last updated: 2012-07-24
|
21 |
+
License: Free for research and education purposes with attribution
|
22 |
+
Intended Uses
|
23 |
+
Creating word classes for Urdu and English language models
|
24 |
+
Reducing vocabulary size in multilingual NLP applications
|
25 |
+
Discovering semantic relationships between words in Urdu and English
|
26 |
+
Feature engineering for downstream NLP tasks in these languages
|
27 |
+
Cross-lingual applications and research
|
28 |
+
How to Use
|
29 |
+
# Compile the code
|
30 |
+
make
|
31 |
+
|
32 |
+
# Cluster words from Urdu or English text
|
33 |
+
./wcluster --text your_urdu_or_english_text.txt --c 50
|
34 |
+
|
35 |
+
# Output will be in your_urdu_or_english_text-c50-p1.out/paths
|
36 |
+
|
37 |
+
To visualize the clusters:
|
38 |
+
|
39 |
+
./cluster-viewer/build-viewer.sh your_urdu_or_english_text-c50-p1.out/paths
|
40 |
+
|
41 |
+
Training Data
|
42 |
+
This is an algorithm implementation that has been applied to both Urdu and English text. Users can provide their own text data in either language for clustering.
|
43 |
+
|
44 |
+
Performance and Limitations
|
45 |
+
Time complexity: O(N*C²), where N is the number of word types and C is the number of clusters
|
46 |
+
Works best with sufficient text data to capture word distributions
|
47 |
+
No built-in support for multi-word expressions
|
48 |
+
Limited to distributional similarity (doesn't capture all semantic relationships)
|
49 |
+
May require language-specific preprocessing for optimal results with Urdu text
|
50 |
+
References
|
51 |
+
Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4), 467-479.
|
52 |
+
Liang, P. (2005). Semi-supervised learning for natural language processing. Master's thesis, Massachusetts Institute of Technology.
|
53 |
+
Citation
|
54 |
+
If you use this implementation in your research, please cite:
|
55 |
+
|
56 |
+
|
57 |
+
@misc{liang2012brown,
|
58 |
+
author = {Percy Liang}, {Sajjad Rasool},
|
59 |
+
title = {Brown Hierarchical Word Clustering Algorithm for Urdu and English},
|
60 |
+
year = {2012},
|
61 |
+
howpublished = {\url{https://github.com/percyliang/brown-cluster}}
|
62 |
+
}
|