Due Sunday, 2006-10-29
Download
these two Python modules:
utils.py and
text.py.
Your program should import text, which imports utils.
In this assignment you'll implement and try out the HAL (Hyperspace Analogue to Language) model of statistical word meaning developed by Curt Burgess and his students.
The model starts with a large corpus of text in some language. For each of the words in the corpus that occurs more than some minimal number of times, it records the frequencies that other words in the corpus tend to occur before and after the word within a given window size. It does this by storing a vector of co-occurrences for each word that is as long as the number of different words. Together these co-occurrence vectors make up the rows of a square matrix of co-occurrences. For each appearance of a word in the corpus, these co-occurrences are updated. Only the context before the word is recorded because the context after the word will appear in the column in the matrix that corresponds to that word. The amount that is added for each co-occurrence decreases with the distance; it is the window size minus the distance between the words (where adjacent words have a distance of 0), and it is 0 for distances greater than the window size.
The table below illustrates the values that would be recorded if the entire corpus consisted of the sentence it's the end of the glazed donut drought, if the window size were 5, and if the minimum occurrence frequency of the recorded words were set to 1.
its the end of glazed donut drought its - 0 0 0 0 0 0 the 7 - 4 5 0 0 0 end 4 5 - 0 0 0 0 of 3 4 5 - 0 0 0 glazed 1 7 3 4 - 0 0 donut 0 5 2 3 5 - 0 drought 0 3 1 2 4 5 -
Next a raw "meaning" vector is created for each word by concatenating its row vector and its column vector, thus including both the preceding and following contexts of the words. For example, in the example above, the raw meaning vector for the becomes [7, 0, 4, 5, 0, 0, 0, 0, 0, 5, 4, 7, 5, 3] The vectors at this point are not directly comparable because words that are more frequent will tend to have larger values, so we need to normalize the vectors by making them all the same vector length (it doesn't matter what length). With length 1.0, the vector for the becomes [.479, 0, .273, .342, 0, 0, 0, 0, 0, .342, .273, .479, .342, .205].
The normalized meaning vectors can now be compared. For this we can use, for example, Euclidian distance. Given our somewhat limited corpus of 8 word tokens and 7 word types, we don't get very interesting results. For example, the distance between end and its, 0.957, is smaller than the distance between end and donut, 1.160.
To implement and test HAL, we need some data.
For this, you use the procedures in text.py, which
convert HTML to lists of paragraph-length lists of words.
The following gets the HTML from three short stories in the Project Gutenberg collection, given the URLs, and converts them to lists of lists of word strings, breaking the sublists at paragraph boundaries.
>>> text_segs = process_webpages('http://www.gutenberg.org/files/46/46-h/46-h.htm',
'http://www.gutenberg.org/dirs/etext05/magi10h.htm',
'http://www.gutenberg.org/dirs/etext97/usher10h.htm')
Reading http://www.gutenberg.org/files/46/46-h/46-h.htm
Processing http://www.gutenberg.org/files/46/46-h/46-h.htm
Reading http://www.gutenberg.org/dirs/etext05/magi10h.htm
Processing http://www.gutenberg.org/dirs/etext05/magi10h.htm
Reading http://www.gutenberg.org/dirs/etext97/usher10h.htm
Processing http://www.gutenberg.org/dirs/etext97/usher10h.htm
Of course this will only work if you're connected to the Internet.
If you're going to use the same pages over and over, you may want to download the files.
In that case use process_files with the local file paths instead of
process_webpages.
process_webpages and process_files get rid of most of the non-content parts of the files and know where to look for the content in files from the Gutenberg Project and from BBC News ('http://news.bbc.co.uk/'). Don't try process_webpages on Wikipedia articles because these require the user agent to be declared when they are accessed.
Next you need to extract all of the words that occur at least a certain number of times
in the text segments.
The procedure find_words returns a pair, with the first element giving the total number of word tokens and word types, and the second returning all of the unique words occurring above the threshold (which defaults to 10).
Any string containing a number is excluded from this list.
These words (481 of them in the example) are those that we will be creating meaning vectors for.
>>> words = find_words(text_segs) >>> words[0] (42403, 6206) >>> len(words[1]) 481
Everything else you have to implement.
In a separate module called hal_<your_name>.py, import text.py.
In my version, I created a class for the whole matrix and a class for each
word (though you don't have to do it this way).
The following initializes the matrix with a row and column for each word.
It also records where each word is, so that words' rows and vectors
can be looked up by their strings and the system will know which words are not
in the matrix at all (like 'python' in this example).
>>> memory = Memory(words[1])
>>> memory.get_position('make')
273
>>> memory.get_position('python')
-1
Next we need to record the co-occurrence statistics in a list of text segments, yielding something like what appears in the table above.
This goes through all of the words in the matrix, for each finds each of its occurrences in each of the text segments (you can use the utils function indices to help with this), and records all of the co-occurrences within the window around the word, ignoring anything that is beyond the text segment.
>>> memory.record(text_segs)
Recorded 100 words
Recorded 200 words
Recorded 300 words
Recorded 400 words
>>> memory.get_word('make')
[0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
Finally we need to make the meaning vector for each word by concatenating its row vector and column vector in the matrix and then normalize this (you can use the utils procedure normalize for this).
>>> memory.make_meanings()
>>> memory.get_word('make').meaning
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17383091216625271, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.010864432010390794, 0.0, 0.02172886402078, ...]
Now we can compare the meaning vectors for words to see how well the algorithm works.
We would expect word pairs like 'the'/'a' and 'man'/'woman' to be relatively similar, for example.
(My compare function returns the distance between the vectors, so smaller values represent similarity.)
Of course how successful this is depends on the amount of data that went into the computation
(in this case not much).
>>> memory.compare('the', 'a')
0.59463531833697791
>>> memory.compare('man', 'woman')
0.74723470321935426
>>> memory.compare('the', 'woman')
1.2482614838422446
>>> memory.compare('believe', 'think')
0.73242270749006644
>>> memory.compare('believe', 'man')
1.2256424973901725
Submit your hal_<your_name>.py file and a trace of the program running (including an indication of where your data came from).