|
Implementation of the Brown hierarchical word clustering algorithm. |
|
Percy Liang |
|
Release 1.3 |
|
2012.07.24 |
|
|
|
Input: a sequence of words separated by whitespace (see input.txt for an example). |
|
Output: for each word type, its cluster (see output.txt for an example). |
|
In particular, each line is: |
|
<cluster represented as a bit string> <word> <number of times word occurs in input> |
|
|
|
Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ |
|
is the number of clusters. |
|
|
|
References: |
|
|
|
Brown, et al.: Class-Based n-gram Models of Natural Language |
|
http: |
|
|
|
Liang: Semi-supervised learning for natural language processing |
|
http: |
|
|
|
Compile: |
|
|
|
make |
|
|
|
Run: |
|
|
|
# Clusters input.txt into 50 clusters: |
|
./wcluster --text input.txt --c 50 |
|
# Output in input-c50-p1.out/paths |
|
|
|
============================================================ |
|
Change Log |
|
|
|
1.3: compatibility updates for newer versions of g++ (courtesy of Chris Dyer). |
|
1.2: make compatible with MacOS (replaced timespec with timeval and changed order of linking). |
|
1.1: Removed deprecated operators so it works with GCC 4.3. |
|
|
|
============================================================ |
|
(C) Copyright 2007-2012, Percy Liang |
|
|
|
http: |
|
|
|
Permission is granted for anyone to copy, use, or modify these programs and |
|
accompanying documents for purposes of research or education, provided this |
|
copyright notice is retained, and note is made of any changes that have been |
|
made. |
|
|
|
These programs and documents are distributed without any warranty, express or |
|
implied. As the programs were written for research purposes only, they have |
|
not been tested to the degree that would be advisable in any important |
|
application. All use of these programs is entirely at the user's own risk. |
|
|