ReySajju742 commited on
Commit
5bc14cb
·
verified ·
1 Parent(s): b5aeec1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -3
README.md CHANGED
@@ -1,3 +1,62 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ur
5
+ - en
6
+ tags:
7
+ - nlp
8
+ - urdunlp
9
+ - algoritham
10
+ ---
11
+ Brown Hierarchical Word Clustering Model Card for Urdu and English
12
+ Model Description
13
+ This implementation of the Brown hierarchical word clustering algorithm groups words into clusters based on their distributional similarity in text. The algorithm creates a binary tree of word clusters, where words that appear in similar contexts are grouped together. This version has been applied to both Urdu and English text data.
14
+
15
+ Model Details
16
+ Developed by: Percy Liang (original implementation)
17
+ Model type: Unsupervised hierarchical word clustering algorithm
18
+ Languages: Urdu and English
19
+ Version: 1.3
20
+ Last updated: 2012-07-24
21
+ License: Free for research and education purposes with attribution
22
+ Intended Uses
23
+ Creating word classes for Urdu and English language models
24
+ Reducing vocabulary size in multilingual NLP applications
25
+ Discovering semantic relationships between words in Urdu and English
26
+ Feature engineering for downstream NLP tasks in these languages
27
+ Cross-lingual applications and research
28
+ How to Use
29
+ # Compile the code
30
+ make
31
+
32
+ # Cluster words from Urdu or English text
33
+ ./wcluster --text your_urdu_or_english_text.txt --c 50
34
+
35
+ # Output will be in your_urdu_or_english_text-c50-p1.out/paths
36
+
37
+ To visualize the clusters:
38
+
39
+ ./cluster-viewer/build-viewer.sh your_urdu_or_english_text-c50-p1.out/paths
40
+
41
+ Training Data
42
+ This is an algorithm implementation that has been applied to both Urdu and English text. Users can provide their own text data in either language for clustering.
43
+
44
+ Performance and Limitations
45
+ Time complexity: O(N*C²), where N is the number of word types and C is the number of clusters
46
+ Works best with sufficient text data to capture word distributions
47
+ No built-in support for multi-word expressions
48
+ Limited to distributional similarity (doesn't capture all semantic relationships)
49
+ May require language-specific preprocessing for optimal results with Urdu text
50
+ References
51
+ Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4), 467-479.
52
+ Liang, P. (2005). Semi-supervised learning for natural language processing. Master's thesis, Massachusetts Institute of Technology.
53
+ Citation
54
+ If you use this implementation in your research, please cite:
55
+
56
+
57
+ @misc{liang2012brown,
58
+ author = {Percy Liang}, {Sajjad Rasool},
59
+ title = {Brown Hierarchical Word Clustering Algorithm for Urdu and English},
60
+ year = {2012},
61
+ howpublished = {\url{https://github.com/percyliang/brown-cluster}}
62
+ }