Introduction Major engineering goals of NLP: ???? -- enable people to interact with computers in natural language, as they do on Star Trek and in many SciFi books TV shows and movies Current example: airline reservation system "Conversational Agents" -- translate written material (books, manuals, legal and business documents) from one language to another. (EC need; international commerce) "Machine Translation" -- find information to satisfy requests expressed in NL by selecting the most relevant documents/paragraphs from a large collection. "Information retrieval" -- automatically summarize the content of a text document or collection "Information Extraction" --> fill the fields of a database "Text Summarization" --> select or generate a text summary A more modest goal: word completion/disambiguation for mobile devices and for users with disabilities. Scientific goal: understand how people learn and use language; understand general principles underlying linguistic communication (whether by people or machines). Cf. the Human Genome Project (science) Finding genetic correlates of common diseases; creating more bug-resistant and travel-worthy crops (engineering) The engineering and science goals are synergistic, not antagonistic. ------------------------------------------------------------------------- Relationship with AI -Can a computer "understand" NL Diagram of "cognitive model". Words and phrases refer to something, but what? Not real-world elements but cognitive elements! (which may in turn have real-world counterparts). Therefore it IS possible: there must be a world model in the computer. The SHRDLU system -- a breakthrough in NLP. How did SHRDLU work? Inside the computer was a model of the "blocks world". Each block had properties (color, shape, size, location). Inside the computer was also a set of rules for how to accomplish various goals (in terms of more primitive actions such as "pickup" and "movehand"), and an inference-based problem solver to create the necessary sequence of actions to carry out a command. Therefore noun phrases could actually refer to something. And verbs such as pick up and move also referred to something that the computer could understand. This work gave rise to a whole strain of research throughout the 1970's and 1980's of building more and more sophisticated model-based NL "agents" and question-answering systems. However, these hit a plateau due to the "knowledge bottleneck" which let to brittleness of systems. It is still possible to build interesting special purpose systems using the techniques developed during that time; but the rise of the internet and more powerful computers led to a shift of interest to corpus-based research which is robust and statistical in nature, and which is not aimed at the kind of model-driven understanding exhibited by earlier systems. ------------------------------------------------------------------------- Some concepts/terminology in Chapter 1: - over 125,000 words in North American English - phoneme (phonetics vs. phonology) - bigrams (the importance of N-grams) - corpus - robust models and systems Go over course syllabus Introducing the Brown corpus ************************************************************* Go over NLP diagram from NLTK Chapter 1. Note: the bi-directional use of language-related data repositories shown in the diagram is not necessarily accurate. Note: the pipeline architecture does not really work. Why?? (It's not easy to wreck a nice beach.) Corpus-based concepts: Words - what is a word? Word tokens Word types Frequency or count Relative (or normalized) frequency Assignment 1 Introducing Python The new hot scripting language, with strong support for strings and regular expressions. --variables are untyped --Strings, lists, maps and sets are built in types with rich capabilities supported by syntactic structures --indentation is used for block structure -- using Python from command line (use of raw_input()) -- using IDLE Listener level operation Expressions, assignment, while loop, define function loading modules and .pth files >>> f=open("startbrown.txt") >>> f.readline() ' FORM C (TAGGED VERSION) OF THE BROWN CORPUS\n' Example of counting the frequencies of words in the startbrown corpus and display the most frequent 20 word types. ########################################################################## # Written by C. Hafner, Spring 2007 for NLP class # demonstration program for computing frequency counts of the first 1,000 lines # tagged brown corpus. Each line has 3 data elements: word, tag, location counts = {} brown = open('C:\Python25\NLPclass\startbrown.txt') for line in brown.readlines()[4:]: word = line.split()[0] # default is to skip all whitespace if word[0] == "*": word = word[1:] # remove leading * from word if word not in counts: counts[word] = 0 counts[word] += 1 # we are done counting, now print the 20 most frequent words # we need to reverse each item in the dictionary to sort by count not word def reversepair(p): return (p[1] , p[0]) # counts.items() returns a list of 2 element sequences (pairs) # illustrates mapping and keyword arguments result = sorted(map(reversepair, counts.items()),reverse=True) for w in result[0:20]: print w[1], " occurs " , w[0], " times." # needed to keep command window open if you double-click to run raw_input("Press any key to exit") ############################################################################# REGULAR Expressions: re package Basic methods re.findall - puts substrings matching pattern into a list re.split - splits the string on occurrences of pattern re.sub - replaces all instances of pattern >>> re.findall('f.+f', "foofoofoofoofoof") ['foofoofoofoofoof'] >>> re.findall('f.+?f', "foofoofoofoofoofoofoof") # non-greedy version ['foof', 'foof', 'foof', 'foof'] >>> re.sub('f.+?f',"X", "foofoofoofoofoofoofoof") 'XooXooXooX' >>> re.split('f.+?f', "foofoofoofoofoofoofoof") ['', 'oo', 'oo', 'oo', ''] patterns and wild cards optional and repeating elements greedy vs. non-greedy alternatives character class,ranges, and complementation the | operator anchors groups groups in the pattern control what gets returned. Use (?: ) to limit () to scoping only. special characters: \n, \s, \b and use of r'xxx' raw string