Introduction

Major engineering goals of NLP:  ????
  -- enable people to interact with computers in natural language, as they do
     on Star Trek and in many SciFi books TV shows and movies
     Current example: airline reservation system
     "Conversational Agents"
  -- translate written material (books, manuals, legal and business documents)
     from one language to another. (EC need; international commerce)
     "Machine Translation"
  -- find information to satisfy requests expressed in NL by selecting the 
     most relevant documents/paragraphs from a large collection.
     "Information retrieval" 
  -- automatically summarize the content of a text document or collection
     "Information Extraction" --> fill the fields of a database
     "Text Summarization" --> select or generate a text summary

A more modest goal: word completion/disambiguation for mobile devices and
for users with disabilities.

Scientific goal: understand how people learn and use language; understand
 general principles underlying linguistic communication (whether by
 people or machines).  Cf. the Human Genome Project (science)
 Finding genetic correlates of common diseases; creating more bug-resistant 
 and travel-worthy crops (engineering)

 The engineering and science goals are synergistic, not antagonistic.
-------------------------------------------------------------------------
Relationship with AI

-Can a computer "understand" NL
Diagram of "cognitive model".  Words and phrases refer to something,
but what?  Not real-world elements but cognitive elements! (which
may in turn have real-world counterparts).

Therefore it IS possible: there must be a world model in the
computer.  The SHRDLU system -- a breakthrough in NLP.  How did
SHRDLU work?

Inside the computer was a model of the "blocks world".  Each block
had properties (color, shape, size, location).  Inside the computer
was also a set of rules for how to accomplish various goals (in
terms of more primitive actions such as "pickup" and "movehand"),
and an inference-based problem solver to create the necessary
sequence of actions to carry out a command.

Therefore noun phrases could actually refer to something.  And
verbs such as pick up and move also referred to something that
the computer could understand.

This work gave rise to a whole strain of research throughout the
1970's and 1980's of building more and more sophisticated model-based
NL "agents" and question-answering systems. However, these hit a
plateau due to the "knowledge bottleneck" which let to brittleness
of systems. It is still possible to build interesting special
purpose systems using the techniques developed during that time;
but the rise of the internet and more powerful computers led to
a shift of interest to corpus-based research which is robust and
statistical in nature, and which is not aimed at the kind of
model-driven understanding exhibited by earlier systems.

-------------------------------------------------------------------------

Some concepts/terminology in Chapter 1:
  - over 125,000 words in North American English
  - phoneme (phonetics vs. phonology)
  - bigrams (the importance of N-grams)
  - corpus
  - robust models and systems

Go over course syllabus

Introducing the Brown corpus

*************************************************************
Go over NLP diagram from NLTK Chapter 1.
  Note: the bi-directional use of language-related data repositories 
  shown in the diagram is not necessarily accurate.

  Note: the pipeline architecture does not really work. Why??
    (It's not easy to wreck a nice beach.)

Corpus-based concepts:
  Words - what is a word?
  Word tokens
  Word types
  Frequency or count
  Relative (or normalized) frequency

Assignment 1

Introducing Python
The new hot scripting language, with strong support for strings and
regular expressions.

--variables are untyped
--Strings, lists, maps and sets are built in types with rich 
  capabilities supported by syntactic structures
--indentation is used for block structure

    -- using Python from command line
       (use of raw_input())
    -- using IDLE

    Listener level operation

    Expressions, assignment, while loop, define function

    loading modules and .pth files

>>> f=open("startbrown.txt")
>>> f.readline()
'    FORM C (TAGGED VERSION) OF THE BROWN CORPUS\n'

Example of counting the frequencies of words in the startbrown corpus
and display the most frequent 20 word types.

##########################################################################
# Written by C. Hafner, Spring 2007 for NLP class
# demonstration program for computing frequency counts of the first 1,000 lines
# tagged brown corpus.  Each line has 3 data elements: word, tag, location
counts = {}
brown = open('C:\Python25\NLPclass\startbrown.txt')
for line in brown.readlines()[4:]:
	word = line.split()[0]  # default is to skip all whitespace
	if word[0] == "*": word = word[1:]  # remove leading * from word
	if word not in counts:
		counts[word] = 0
	counts[word] += 1

# we are done counting, now print the 20 most frequent words

# we need to reverse each item in the dictionary to sort by count not word
def reversepair(p):
	return (p[1] , p[0])

# counts.items() returns a list of 2 element sequences (pairs)
# illustrates mapping and keyword arguments
result = sorted(map(reversepair, counts.items()),reverse=True)

for w in result[0:20]:
	print w[1], " occurs " , w[0], " times."

# needed to keep command window open if you double-click to run
raw_input("Press any key to exit") 
#############################################################################

REGULAR Expressions:

re package

Basic methods
  re.findall - puts substrings matching pattern into a list
  re.split - splits the string on occurrences of pattern
  re.sub - replaces all instances of pattern

>>> re.findall('f.+f', "foofoofoofoofoof")
['foofoofoofoofoof']
>>> re.findall('f.+?f', "foofoofoofoofoofoofoof")  # non-greedy version
['foof', 'foof', 'foof', 'foof']
>>> re.sub('f.+?f',"X", "foofoofoofoofoofoofoof")
'XooXooXooX'
>>> re.split('f.+?f', "foofoofoofoofoofoofoof")
['', 'oo', 'oo', 'oo', '']

patterns and wild cards
optional and repeating elements
  greedy vs. non-greedy
alternatives
  character class,ranges, and complementation
  the | operator
anchors

groups
  groups in the pattern control what gets returned.  Use (?:  ) to
    limit () to scoping only.
special characters: \n, \s, \b and use of r'xxx'  raw string