ReySajju742
/

urdu-nlp

Model card Files Files and versions

ReySajju742 commited on May 6

Commit

5bc14cb

·

verified ·

1 Parent(s): b5aeec1

Update README.md

Files changed (1) hide show

README.md +62 -3

README.md CHANGED Viewed

@@ -1,3 +1,62 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- ur
+- en
+tags:
+- nlp
+- urdunlp
+- algoritham
+---
+Brown Hierarchical Word Clustering Model Card for Urdu and English
+Model Description
+This implementation of the Brown hierarchical word clustering algorithm groups words into clusters based on their distributional similarity in text. The algorithm creates a binary tree of word clusters, where words that appear in similar contexts are grouped together. This version has been applied to both Urdu and English text data.
+Model Details
+Developed by: Percy Liang (original implementation)
+Model type: Unsupervised hierarchical word clustering algorithm
+Languages: Urdu and English
+Version: 1.3
+Last updated: 2012-07-24
+License: Free for research and education purposes with attribution
+Intended Uses
+Creating word classes for Urdu and English language models
+Reducing vocabulary size in multilingual NLP applications
+Discovering semantic relationships between words in Urdu and English
+Feature engineering for downstream NLP tasks in these languages
+Cross-lingual applications and research
+How to Use
+# Compile the code
+make
+# Cluster words from Urdu or English text
+./wcluster --text your_urdu_or_english_text.txt --c 50
+# Output will be in your_urdu_or_english_text-c50-p1.out/paths
+To visualize the clusters:
+./cluster-viewer/build-viewer.sh your_urdu_or_english_text-c50-p1.out/paths
+Training Data
+This is an algorithm implementation that has been applied to both Urdu and English text. Users can provide their own text data in either language for clustering.
+Performance and Limitations
+Time complexity: O(N*C²), where N is the number of word types and C is the number of clusters
+Works best with sufficient text data to capture word distributions
+No built-in support for multi-word expressions
+Limited to distributional similarity (doesn't capture all semantic relationships)
+May require language-specific preprocessing for optimal results with Urdu text
+References
+Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4), 467-479.
+Liang, P. (2005). Semi-supervised learning for natural language processing. Master's thesis, Massachusetts Institute of Technology.
+Citation
+If you use this implementation in your research, please cite:
+@misc{liang2012brown,
+  author = {Percy Liang}, {Sajjad Rasool},
+  title = {Brown Hierarchical Word Clustering Algorithm for Urdu and English},
+  year = {2012},
+  howpublished = {\url{https://github.com/percyliang/brown-cluster}}
+}