Lab 2 Details: Building DBs using random file access -- COM1101 Winter 2001

Updated 28 December 2000

Professor Futrelle
College of Computer Science, Northeastern U., Boston, MA

In this project, covering Labs 2 and 3, you will be creating a table of records in a file on disk for a simple relational database and accessing the records.

This page has more material than the usual machine problem assignment, because it introduces you to a number of topics that apply to all such assignments in this course. You will need to read this page carefully in order to do the lab. The Lab 2 assignment itself, what you are to do and hand in, is here.

BEFORE YOU START

I cannot emphasize too strongly that when working on your machine problems, you need to proceed a step at a time. Write a tiny program to check out something you're designing. Compile and run it. When that works, add the next item you want to check out, possibly commenting out your first experiment. Compile that and run it. Proceed in this way to gradually build up a collection of pieces of your program that work. Depending on what style you've chosen, you'll finally converge on the large program you ultimately need to write or you'll have a collection of working pieces that you can then assemble to form your program. The personal rule I follow for medium scale programs is never to write more than three or four lines of new material before compiling, running and checking that the results are right. A lot of different ideas and tactics are discussed below. None of them are complex, but there are enough different items that I'm sure you'll appreciate how important it is to work with each separately, one at time, as you develop your program.

EXTRA CREDIT

In conjunction with the previous paragraph, I'll "put my money where my mouth is", to use a hackneyed phrase, by giving up to 5 points of extra credit if you do the following: For each experiment you do as you develop your code, record the critical portions of your code as well as your output as well as comments on how and why you designed the code as you did and why the output shows that it's working (or not!). Put these in a file of type .txt which you'll hand in. In addition, you're welcome to include a few failed experiments in which the code wouldn't compile or the output was not quite right, with appropriate comments. You can include all or portions of any DB files you create simply by opening them in your editor and pasting the contents into your text file to show what's getting written to your DB files. The reason for doing things this way is that you typically throw away your various experiments as you develop your program. Recording them in the text file captures "snapshots" of your development process. The entire process takes some time but is not difficult, so everyone can earn extra credit if they choose to!

For those of you who have not developed code in Professor Futrelle's class before this

I do not use any of the CCS tools developed by Professor Rasala and others. Instead, I have you write all code yourself. This means no graphics, but on the other hand it gives you the tools you need to sit down and write C++ code on any system at any time. For example, essentially all the code for this course compiles and runs equally well on unix systems. For instructions on creating an "empty console project" see the following pages. (This is what you will have done in Lab 1.)

An instant introduction to relational databases

A relational database is a collection of "tables", each one a file. Each table has rows and columns. An example would be a database for CDs that has three columns, the first is the artist, a string such as "Tracy Chapman", the second column is the album, such as "Telling Stories" and the third column could be the price, a real (float) number, such as 13.99. Each row would be a separate triple of artist, album and price. The columns are also called "fields" and each row is called a "record".

To store these on disk in an efficient fashion that allows rapid access, each field is set to be a specific length so that each record has a fixed total length. We might allow 50 characters for the artist, 80 for the album and a real number for the price represented by six characters, including the decimal sign, for a total of 136 characters. We usually describe the space requirements in bytes; each character is normally one 8-bit byte. If any string or price has too few characters it is filled out with blanks. Hopefully, no string will be too long and no price will be greater than $999.99. If they are, we typically would truncate the strings, chopping off any characters beyond the length limit. If the price overflows we would want to warn the person or program that constructs the database that a serious problem has arisen, not truncate a price! (One of the major Y2K problems was that only two characters had been set aside for the year in many databases, so after 1999, represented by "99", the next year was "00" which the programs interpreted as 1900. Wrong!)

As an illustration, assume the above example with very restricted lengths, so we can show the records easily, where the numbers 0....9 are repeated below the records to show the positions exactly. We'll assume that the artist and album are 8 characters each and the price is 6 characters, as before, for a total of 22 characters, 0 through 21. Here's a sample record obeying this:

Jim     Singin  12.99
0123456789012345678901

Note that there's one blank of padding after the 12.99. When records are written to disk they are simply streamed out, one after the other. So two 22 character records in a table might look like this:

Jim     Singin  12.99 Ming    Dreamin 13.98 
01234567890123456789012345678901234567890123

The two records total 44 characters, 0 through 43. Remember, the numbers 0...9, etc. have been added just so you can see the positions of the various fields in the records. They do not appear in the database at all.

If we know we want to access the n-th record in the above database, we know it will start at n*22, so the 0-th record starts at 0, the 1-st record starts at 22, etc. Most file handling systems on computers allow reading and writing to start at any specific location the program chooses. So if we start reading the file at location 22 and read a total of 22 characters, we will have read "Ming Dreamin 13.98 " (note that the final blank is included). For the more realistic database we introduced first, the n-th record would begin at byte n*136 in the file.

"Random access" to files

A file is just a long sequence (stream) of bytes. The ability to read and write at any position in this stream is contained in the two C++ functions, seekg() and seekp() ("g" for "get" and "p" for "put"). Each takes an integer argument for the position (0 for first byte in the file). So typical uses would be,

   inFile.seekg(streampos);
   outFile.seekp(streampos);

Below, we explain how to use these.

How to write database records

C++ offers a variety of ways of writing the records out in the proper format. Assume for the moment that we have a 136 character string, rec1, that we want to write out as the 17th record to a file stream, dbFile. We simply execute the following two commands, assuming REC_LENGTH is 136:

   dbFile.seekp(17*REC_LENGTH);
   dbFile << rec1;
After the write, the file is set at position 18*REC_LENGTH, so you could continue adding records in this way. Be sure to close the file when you're done writing your records.

It's a bit messier to write out the individual pieces of the record, three fields in our example, because you have to assure the correct padding. You also have to assure that the numbers, in our example, a float, are formatted properly. To do the formatting it's useful to take advantage of the I/O routines available in C++. There are two ways to do this, one by writing to a string as if it's a stream (by including and using <sstream>) and the other by writing the fields out one at a time. Because input will require picking the fields apart, right down to specific byte positions, we'll need to use the sstream approach, which we now detail, for output.

The capability we need is called stringstream and is accessed by the include just mentioned, <sstream>. We'll give a brief example here. Further details can be found on the page the example is taken from this page from U. Mich.

Construct an ostringstream object:
        
        ostringsteam outstr;

Now, use the << operator to add a number to the string stream

        outstr << 5.3;

To obtain the string from the stream call the member function str().

        outstr.str();

Using this approach, you can do all the output you need to the string stream and then extract the string from it and write it to the file position you've set it to. Remember that you'll need to pad any strings that are too short (you should write a function to do this) as well as output the real number in a controlled format, using the techniques in Sec. 5.2 of your textbook. You may also need to pad your numbers with blanks.

One of your goals will be to format a real number (a float) to occupy a certain number of characters and to print in fixed point notation, not scientific notation with an exponent, a power of ten. This can be done either with the stream functions described in your text starting on page 241 or with I/O manipulators, described starting on page 244. The only settings you'll need are for fixed and for the width parameter. An example for some output stream outstream (a file of stringstream) would be:

outstream << setw(6) << fixed << setprecision(2) << 9.99;
which would produce the value positioned in the following way with two decimal places of precision, a total width of six characters and the default right justification (positions shown as before):
  9.99
012345

You'll need to include <iomanip> for the manipulators and use the namespace designator ios:: if you use the stream functions, as shown in the text.

To format strings to occupy a field of fixed width, you'll either need to truncate them if they're too long or pad them with blanks if they're too short. The string functions you need are listed in Display 10.11 on pg. 650. The important length() function you need is omitted there. For a string s, use s.length() to return the length as an integer. Write yourself a function pad(string &s, int n) that contains the appropriate for loop and adds blanks with the assignment statement:   s += ' ';

How to read database records

To read database records, a clean approach is to read in the exact number of characters you want, which can be done using the following function, similar to getline:

// Prototype
istream & getBytes( istream & in, string & str, int count );

// Function definition
istream & getBytes( istream & in, string & str, int count )
{
  char ch;
  str = "";     // empty string, will build one char at-a-time
  int num = 0;

  while( in.get( ch ) && num < count )
    {str += ch;
    num++;
    }

  return in;
}

An example of the use of this function is:

void main()
{
  string inString;

  // open and position the file, etc., with stream fin.
  // [omitted]
  
  getBytes(fin, inString, 136); 
  cout << inString << endl; // will show what you read
}

Once you've read in the entire record to a string, you should create substrings of it, and where necessary, read from them using an input string stream (see the U. Mich notes). Substring creation is described in your textbook, pgs. 919-920. There is a direct way to convert strings to numbers, also described in the textbook, pg. 919. But note that these functions require cstrings, so you'll need to use c_str() on pg. 650.

Other strategies I found useful or neccessary for this lab

I used the following includes:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <iomanip>

I used c_str() to exract the C-string from a string variable to give the form that the file-open commands need (text pg. 650).

If you want to read in just part of a record, you could define a function that will seek to the proper position in the field of a record and then read only the number of bytes for that field, e.g.,

void getFileField(istream & in, string &str, int pos, int count)

I used constants such as FLOAT_LEN for the field and record sizes and float precision.

And of course, I defined functions such as the following to write fields to a string stream. These functions can use the known widths (global constants) as well as the I/O manipulators, to write the correct number of characters.

void addInt(stringstream& s, int oneInt)

I used simple arrays to keep the values I'd need for the various fields. For example, for the floats I used:

float floats[] = {1.1, 22.22, 333.333};

After reading in an entire record with getBytes(), as as a string, I used the substr() function (pg. 650) to extract the portion I needed and then used the str() function to convert it to a string stream and then read from that into a variable of the correct type. So if the string fieldBuff holds the field I want, e.g, "       547", then the following would get that string into an integer variable, intField, using the string stream s:

	s.str(fieldBuff);
	s >> intField;

Warning: The string stream must be reset before you read from it again, using s.clear();. This is because a string stream is a stream, just like a file stream and must be dealt with accordingly. It is much more than just another type of string.