Semantics Model from class

 

This is a simulation of the simple random vector accumulation model of word meaning we invented in class. It uses the Number Generators module that we use with Minerva, and a Reading Tools module that contains some text processing routines. It is currently set up to learn from the Wikipedia text input, which is a short corpus of sports, biology, and literature articles taken from Wikipedia.

 

The model takes as input a list of sentences, one sentence per line (with formatting stripped and all lowercase). For each word encountered, it generates a random environmental vector representing its physical attributes. For each word in a sentence, it updates the memory representation for the word as simply the sum of the environmental vectors for the other words in the sentence.

 

Copy the following files into your directory on Libra:

Semantics.f95

Reading_Tools.f95

Number_Generators.f95

wiki.txt

Elman_Corpus.txt

stoplist

Neighbors.f95

 

To compile on Libra, type:

f95 Number_Generators.f95  Semantics.f95   Reading_Tools.f95 –o Semantics

 

And then you can run by typing:  ./Semantics

 

The output is matrix.mat the word-by-dimension memory matrix (in binary) and word_labels.txt, the list of text labels that go with each row in the memory matrix.

 

To train on the synthetic Elman corpus (simple artificial language) rather than the Wiki corpus, you need to change the name of the input file in Semantics.f95 in the subroutine Read_Stoplist. Also, put a ! infront of the line just above this so the stoplist isnÕt loaded, otherwise most of the 21 words in the Elman language will be blocked as function words. If youÕre training on any real corpus of English, make sure the stoplist is loaded. To switch off stoplist, just comment it out:

 

do i = 1, STOP_WORDS

   ! read(1,*) Stoplist(i)

 enddo

 

And to switch back on, remove the comment mark:

 

do i = 1, STOP_WORDS

   read(1,*) Stoplist(i)

 enddo

 

 

Once you have trained the model, you can query the memory matrix by compiling the nearest neighbors program:

 

f95 Neighbors.f95 –o Neighbors

 

and run:  Ō./Neighbors macbeth 30Ķ    will retrieve the 30 nearest neighbors (most similar memory representations) to the memory representation for Macbeth (remember to use lowercase) and will display each cosine.

 

 

Note: There is some processor-specific problem w/ Libra that wonÕt let this code runÉit runs on all other systems IÕve tried it onÉIÕll figure out the problem and let you know. Until then, you can download the Memory matrix and word labels learned from the Wiki corpus. These are used by the program Neighbors to query near neighbors after learning, and that will run on Libra (as will our Minerva and TODAM simulations).