Your semester project

Each student will do a semester project for the course. It must involve developing software or using an existing AI application. Your project should focus on machine learning or natural language processing. (A project related to uncertainty is an option, but it would be difficult.)

Everyone will have to deal with one challenge: The fact that you will be working on your project before the topics have been covered in class. So part of your effort, and you will get credit for this, is for you learn relevant aspects of the area you're working on as you go along. The major applications I have listed for the two topics have many tutorial documents and examples to guide you.

You will develop your project in three stages (see the Schedule for the due dates):

In the week following your handing in V1 and after V2, I will schedule one-on-one meetings with each of you to discuss your project and give you advice to help you create a good project.

Google docs and the format of your project writeup

I have found that Google docs are very useful for developing your projects. Sharing is straightforward - share your doc with me via my course Google account, cs4100sp11@gmail.com. One advantage of Google docs is that you can see my comments the moment I write them - no need to wait to get a hardcopy handed back. You can hand in a hardcopy of your project, but only if it has such complex mathematics that doing it in Google docs would be tedious. I have used Latex or Latex-based apps to produce equations or equation images which can be inserted into a Google doc.

Guidelines for content and formatting:

Project topics suggestions

Whatever you choose to do, you should explore tutorials, papers, books, etc. I also urge you to join mailing lists for your topic or system so you can look for answers to questions and ask questions yourself. For both topics below, you should get started quickly. Do not wait for me to get to the related material in the textbook. Some of it comes near the end of the course. I am here to help you get started.

Machine Learning

The UC Irvine Machine Learning Repository has many sample datasets for you to use. You can experiment with a variety of machine learning algorithms applied to some dataset you choose. The most popular include the famous Iris set, wine, breast cancer, poker hand, car evaluation, forest fires, etc. You can experiment with the two basic forms of machine learning, supervised, and unsupervised. Your guide in all this is the data mining book by Witten (about the Weka system) which is on reserve in the library. See the Resources page.

Typical projects - almost all should use the Weka system. You should get a thorough understanding of the statistics and performance measures and visualizations that Weka provides.

Natural Language

A major site that has free corpora drawn from a variety of topic areas is the American National Corpus (ANC). The Open ANC copora are what you want.

Many text sites advertise "free books" but most come with strings attached. Project Gutenberg is a legitimate site that has over 33,000 high-quality books. You can download them as plain text, suitable for natural language work.

There are many standard analyses you can do which can be the basis for a good project. They include.

Demonstrating the use of GATE or the Natural Language Toolkit for some of the above problem will make a good project.

The Stanford Natural Language site has many software tools.

Some systems furnish APIs in addition to running out of the box. This will allow you to learn about AI programming, hands-on.

Some projects can use a mix of natural language processing and machine learning, e.g, separating sports stories from national news from international news stories.

Another might use machine learning for Uncertainty problems, using Bayes approaches.




Return to CS4100 homepage.