Semantics Model from class
This is a simulation of the simple random vector accumulation model of word meaning we invented in class. It uses the Number Generators module that we use with Minerva, and a Reading Tools module that contains some text processing routines. It is currently set up to learn from the Wikipedia text input, which is a short corpus of sports, biology, and literature articles taken from Wikipedia.
The model takes as input a list of sentences, one sentence per line (with formatting stripped and all lowercase). For each word encountered, it generates a random environmental vector representing its physical attributes. For each word in a sentence, it updates the memory representation for the word as simply the sum of the environmental vectors for the other words in the sentence.
Copy the following files into your directory on Libra:
Semantics.f95
Reading_Tools.f95
Number_Generators.f95
wiki.txt
Elman_Corpus.txt
stoplist
Neighbors.f95
To compile on Libra, type:
f95 Number_Generators.f95 Semantics.f95
Reading_Tools.f95 –o Semantics
And then you can run by typing: ./Semantics
The output is matrix.mat the word-by-dimension memory matrix (in binary) and word_labels.txt, the list of text labels that go with each row in the memory matrix.
To train on the synthetic Elman corpus (simple artificial language) rather than the Wiki corpus, you need to change the name of the input file in Semantics.f95 in the subroutine Read_Stoplist. Also, put a ! infront of the line just above this so the stoplist isnÕt loaded, otherwise most of the 21 words in the Elman language will be blocked as function words. If youÕre training on any real corpus of English, make sure the stoplist is loaded. To switch off stoplist, just comment it out:
do i = 1, STOP_WORDS
!
read(1,*) Stoplist(i)
enddo
And to switch back on, remove the comment mark:
do i = 1, STOP_WORDS
read(1,*) Stoplist(i)
enddo
Once you have trained the model, you can query the memory matrix by compiling the nearest neighbors program:
f95 Neighbors.f95 –o Neighbors
and run: Ō./Neighbors macbeth 30Ķ will retrieve the 30 nearest neighbors (most similar memory representations) to the memory representation for Macbeth (remember to use lowercase) and will display each cosine.
Note: There is
some processor-specific problem w/ Libra that wonÕt let this code runÉit runs
on all other systems IÕve tried it onÉIÕll figure out the problem and let you
know. Until then, you can download the Memory matrix
and word labels learned from the Wiki corpus.
These are used by the program Neighbors to query near neighbors after learning,
and that will run on Libra (as will our Minerva and TODAM simulations).