3.3   Designing Data Types


This section under major construction.


Designing data types

Theme = limit what you, the programmer, can do. Emphasize encapsulation and immutability.

Designing APIs.

Often the most important and most challenging step in building software is designing the APIs. In many ways, designing good programs is more challenging that writing the code itself. Takes practice, careful deliberation, and many iterations.

Encapsulation.

Designing reusable data types requires great care. The API is a contract between the client and the implementation that specifies the set of permissible operations and what is their required behavior. Encapsulation is the process of separating and compartmentalizing the client from the implementation by hiding how the data type is implemented. This enable you to change the "how" as the program evolves. "A nut and bolt are compatible if they have the same diameter and threads per inch." This is the interface. However, "they may be made of carbon steel, steel, bronze, nylon, titanium, whatever."

Encapsulation enables us to:

For example, although we've been using the String data type since our first program, we have not specified how Java implements it. This is perfectly fine because we only need to know what operations it supports in the API, not the underlying representation.

Some mottos.

Zip codes and time bombs.

There are many examples of programmers making assumptions about the internal representation of a data type, and later having to pay a big price to fix their mistakes. One of the lessons we should learn from this is to use encapsulated data types. If we must change the data representation, the effects are localized to one ADT and we don't need to search through millions of lines of code to find all of the places where we assumed one particular data representation.

Encapsulation in Java. Java provides language support for information hiding. When we declare an instance variable (or method) as private, this means that the client (code written in another module) cannot directly access that instance variable (or method). The client can only access the API through the public methods and constructors. Programmer can modify the implementation of private methods (or use different instance variables) with the comfort that no client will be directly affected.

Program Counter.java implements a counter, e.g., for an electronic voting machine. It encapsulates a single integer to ensure that it can only get incremented by one at at time and to ensure that it never goes negative. The goal of data abstraction is to restrict which operations you can perform. Can ensure that data type value always remains in a consistent state. Can add logging capability to hit(), e.g., to print timestamp of each vote. In the 2000 presidential election, Al Gore received negative 16,022 votes on an electronic voting machine in Volusia County, Florida. The counter variable was not properly encapsulated in the voting machine software!

non-encapsulated data type                   encapsulated data type
--------------------------------------       --------------------------------------
public class Counter {                       public class Counter { 
   public final String name;                    private final String name;
   public int count;                            private int count; 
   public Counter(String name) {                public Counter(String name) {
      count = 0;                                   count = 0;
      this.name = name;                            this.name = name;
   }                                            }
   public void hit() { count++;      }          public void hit() { count++;      }
   public int  get() { return count; }          public int  get() { return count; }
}                                            }

client                                       client
--------------------------------------       ---------------------------------------
Counter c = new Counter("Volusia");          Counter c = new Counter("Volusia");
c.count = -16022;   // legal                 c.count = -16022;   // compile-time error

Data representation changes in scientific applications.

Simple example: represent a point using Cartesian or polar coordinates. Polynomials (coefficents vs. point-value), matrices (sparse vs. dense).

Immutability.

An immutable data type is a data type such that the value of an object never changes once constructed. Examples: Complex and String. When you pass a String to a method, you don't have to worry about that method changing the sequence of characters in the String. On the other hand, when you pass an array to a method, the method is free to change the elements of the array.

Immutable data types have numerous advantages. they are easier to use, harder to misuse, easier to debug code that uses immutable types, easier to guarantee that the class variables remain in a consistent state (since they never change after construction), no need for copy constructor, are thread-safe, work well as keys in symbol table, don't need to be defensively copied when used as an instance variable in another class. Disadvantage: separate object for each value.

Josh Block, a Java API architect, advises that "Classes should be immutable unless there's a very good reason to make them mutable....If a class cannot be made immutable, you should still limit its mutability as much as possible."

Give example where function changes value of some Complex object, which leaves the invoking function with a variable whose value it cannot rely upon.

mutable          immutable
------------------------------------
Counter          Complex
MovingCharge     Charge
Draw             String
array            Vector
java.util.Date   primitive types
Picture          wrapper types

Design-by-contract.

Language mechanism that enables you to verify assumptions about your program as it is running. For example, if you have a data type that represents a particle, you might assert that its mass is positive and its spped is less than the speed of light. Or if you have a method to add two vectors of the same length, you might assert that the length of the resulting vector also has the same length. Design-by-contract model of programming. Transition to exceptions and assertions.

Date.

A culminating example that illustrates use of final, assert, exceptions, immutability, and private helper methods. Would make a nice anatomy diagram. Also could be a set-up for Comparable in the next section.

Application: commercial transaction system. Want to know 3 (business days) from today.

Exercise: add a method daysUntil() and use this to implement compareTo() in next section. Add a method day() that returns a String representing the day of the week (e.g., Sunday, Monday).

Program Date.java is an immutable data type that represents a date. Internally, it uses three integers for the month, day, and year. It includes a constructor, toString(), next(), isAfter().

By accessing the data only through the interface, our data type can perform range checking (e.g., no such thing as January 0 or February 30). For simplicity, we ignore leap years. (See Exercise XYZ.) By declaring all instance variables as final, we ensure that they can only be initialized in the constructor. This makes it easy to check the invariant that the values are always in a consistent state. The time invested in performing routine error-checking more than pays off as we start writing more complex programs. Using encapsulation enables us to isolate to a few places where an object can change state.

Vectors.

As second example of a mathematical abstraction, we now consider the vector data type. Like Complex, the basic definition of the vector abstraction is familiar because it has played a central role in applied mathematics for over 100 years. The field of mathematics known as linear algebra is concerned with properties of vectors. Linear algebra is a rich and successful theory with numerous applications that plays an important role in all fields of social and natural science. Vector

A spatial vector is an abstract entity that has a magnitude and a direction. Vectors provide a natural way to describe properties of the physical world, such as force, velocity, momentum, or acceleration. One standard way to specify a vector is as an arrow from the origin to a point in a Cartesian coordinate system: the direction is the ray from the origin to the point and the magnitude is the length of the arrow (distance from the origin to the point). To specify the vector it suffices to specify the point.

The concept extends to any number of dimensions: an ordered list of N real numbers (the coordinates of an N-dimensional point) suffices to specify a vector in N-dimensional space. By convention, we use a boldface letter to refer to a vector and numbers or indexed variable names (the same letter in italics) separated by commas within square brackets to denote its value. For example, we might use x to denote (x[0], x[1], ..., x[N-1]) and y to denote (y[0], y[1], ..., y[N-1]). In Java, this decision implies that we can represent each Vectors data type value as an array of double values.

The basic operations on vectors are to add two vectors, multiply a vector by a scalar, compute the dot product of two vectors, and to com- pute the magnitude and direction, as follows:

The result of addition and scalar multiplication and the direction are vectors, but the magnitude and the dot product are scalar quantities (real numbers). For example, if x = (0, 3, 4, 0), and y = (0, -3, 1, -4), then x + y = (0, 0, 5, -4), 3x = (0, 9, 12, 0), x dot y = -5, |x| = 5, and x / |x| = (0, .6, .8, 0). The direction vector is a unit vector: its magnitude is 1. These basic mathematical definitions lead immediately to an API:

Vector API
Program Vector.java is an immutable class that implements this API. Note that the instance variable is not a primitive type, but rather an array.

The this reference. The magnitude() and direction() methods in Vector make use of the name this. Java provides the this keyword to give us a way to refer to an instance method to the object whose name was used to invoke this method. You can use this in code in the same way you use any other name. For example, some Java programmers would prefer to code plus() in Complex as follows:

public Complex plus(Complex b) { 
   Complex a = this; 
   return new Complex(a.re + b.re, a.im + b.im); 
} 
to emphasize that a and b play an equal role in the operation. Some Java programmers always use this to refer to instance variables. Their scope is so large (the whole class) that this policy is easy to defend. However, it does result in a surfeit of this keywords, so we take the opposite tack and use this sparingly in our code. If a method has no parameters, or parameters whose names do not conflict with instance variable names, we normally refer directly to instance variables instead of using this.

Similarity search.

Reference: Gauging Similarity via N-Grams. CompareAll.java performs similarity search. It relies on the data type Document.java that uses hashCode().

Q + A

Q. What happens if I try to acccess a private instance variable or method from another file?

A. A compile-time error that says the given instance variable or method has private access in the given class.

Q. Does Java have other access modifiers besides public and private? Yes, although we restrict attention in this book to public and private. Here's a table.

Access modifier  Class  Package  Subclass  Any Class
private            Y       N         N         N 
[no specifier]     Y       Y         N         N 
protected          Y       Y         Y         N 
public             Y       Y         Y         Y  

Q. The instance variable x in type Complex is declared with access modifier private. How come in the plus method I can access both this.x and b.x?

A. The granularity of private access is at the class level, not the instance level. Declaring an instance variable as private means that it is not directly accessible from any other class. Methods within the Complex class can access (read or write) the instance variables of any instance in that class. It might be nice to have a more restrictive access modifier, say superprivate, that would make the granularity at the instance level so that only the invoking object can access its instance variables.

Q. How would I add a constructor to the Complex data type that takes r and θ as arguments?

A. It's a problem since you already have one constructor that takes two real arguments. Instead, you should probably create two methods createRect(x, y) and createPolar(r, theta) that create and return new objects. This style will also help remind the client to pass in arguments of the correct form.

Q. Is it possible to defeat Java's type safety features?

A. If there are no bugs in the Java implementation, an untrusted program should not have access to the private data of a trusted program. This guarantee assumes that "the computer faithfully executes its specified instruction set." If a cosmic ray strikes the computer's memory, it can flip a bit, thereby invalidating this axiom. Appel and Govindavajhala showed how to craft a program so that when executed, any single bit error would yield a 70% chance of the program gaining unfettered access to the JVM. They also demonstrated that such errors could be induced by shining a 50 watt spotlight bulb at the memory for one minute. Smartcards are particularly susceptible since the attacker has physical access to the computer.

Also reflection can defeat the type safety features, but this requires that the user permit such actions.

Exercises

  1. Modify Date to handle leap years. Hint: change 28 to 29, add a helper method isLeapYear(), and add one line to next().
  2. Add a method daysUntil() to Date that takes a Date b as an argument and returns the number of days between b and the invoking date.
  3. Create an implementation Date2.java that represents a date a single integer that counts the number of days since January 1, 1970. Compare to Date.java.
  4. Represent a point in time by using an int to store the number of seconds since January 1, 1970. When will programs that use this representation face a time bomb? How should you proceed when that happens?
  5. Create a data type GeographicCoordinate that represents a geographic coordinate either in (degrees, minutes, seconds, sign) or in floating point.
  6. Create a data type Location for dealing with locations on Earth using spherical coordinates (latitude/longitude). Include methods to generate a random location on the surface of the Earth, parse a location "25.344 N, 63.5532 W" and compute the great circle distance between two locations.
  7. Create a Rectangle ADT that represents a rectangle. Represent a rectangle by two points. Include a constructor, a toString method, a method for computing the area, and a method for drawing using our graphics library.
  8. Repeat the previous exercise, but this time represent a Rectangle as the lower left endpoint and the width and height.
  9. Repeat the previous exercise, but this time represent a Rectangle as the center and the width and height.
  10. Particle. Create a data type for a 3D particle with position (rx, ry, rz), mass (m), and velocity (vx, vy, vz). Include a method to return its kinetic energy = 1/2 m (vx^2 + vy^2 + vz^2).

    Alternate representionation: position (rx, ry, rz), mass (m), and momentum (px, py, pz). In this case vx = px / mass, vy = py / mass, vz = pz / mass. kinetic energy = 1/2 (px^2 + py^2 + pz^2) / mass.

  11. Generating random numbers. Different methods to generate a random number from the standard Gaussian distribution. Here, encapsulation enables us to replace one version with another that is more accurate or efficient. Trigonometric method is simple, but may be slow due to calling several transcendental functions. More importantly, it suffers from numerical stability problems when x1 is close to 0. Better method is alternate form of Box-Muller method. reference. Both methods require two values from a uniform distribution and produce two values from the Gaussian distribution with mean 0 and standard deviation 1. Can save work by remembering the second value for the next call. (This is how it is implemented in java.util.Random.) Their implementation is the polar method of Box-Muller, saving the second random number for a subsequent call. (See Knuth, ACP, Section 3.4.1 Algorithm C.)
  12. LAX aiport shutdown. On September 14, 2004 Los Angeles airport was shut down due to software breakdown of a radio system used by air traffic controllers to communicate with pilots. The program used a Windows API function call GetTickCount() which returns the number of milliseconds since the system was last rebooted. The value is returned as a 32 bit integer, so after approximately 49.7 days it "wraps around." The software developers were aware of the bug, and instituted a policy that a technician would reboot the machine every month so that it would never exceed 31 days of uptime. Oops. LA Times blamed the technician, but the developers are more to blame for shoddy design.

Creative Exercises

  1. International bank account (depositDollars, depositEuros, withdrawDollars, withdrawEuros). How to represent money (cents, BigDecimal, precision, etc.)
  2. Turtle graphics. Turtle location starting at origin (forward, rotate, draw) Use polar and Cartesian coordinates.
  3. Polar representation of points. Point.java and PointPolar.java implement the following point interface using rectangular and polar coordinates, respectively.
    Point()
    Point(double, double)
    double x() 
    double y() 
    double r()
    double theta()
    double distance(Point)
    public String toString()
    
  4. Spherical coordinates. Represent a point in 3D space using Cartesian coordinates (x, y, z) or spherical coordinates (r, θ φ). To convert from one to the other, use

    spherical coordinates
  5. Colors. Can represent in RGB, CMYK, or HSV formats. Natural to have different implementations of same interface.
  6. Time. Implement an ADT that represents the time of day, say with seconds, minutes, and hours. Representation 1: three fields for hours, minutes, and seconds. Representation 2: number of seconds since 12 midnight. Include compareTo, plus.
  7. ZIP codes. Implement an ADT that represents a USPS ZIP code. Support both the original 5 digit format and the newer (but optional) ZIP+4 format.
  8. VIN number. Read in a VIN number and report back all relevant information.
  9. Vector fields. Write a data type for force vectors in two dimensions. You should be able to add vectors using the usual rules from physics using x sin(theta) and x cos (theta)....

  10. Complex numbers. In Section 3.2, we considered program Complex.java, which is an ADT for complex numbers. Internally, it stores a complex number c = x + iy as its real and imaginary parts (x, y). This is called the rectangular representation. Write a new implementation that represents a complex number as c = r (cos θ + i sin θ). This is known as the polar representation, and we refer to r as the modulus and θ as the angle. This representation is especially useful if we are multiplying or dividing complex numbers. Program Complex.java implements the same complex number data type as Complex.java, but stores the polar representation (r, θ) instead of the the rectangular representation (x, y). Since it has exactly the same public methods and public constructor as the version with rectangular coordinates, we can use it interchangeably with the other version of Complex.java. The main difference to the client is in different performance properties. For example, implementing an exponentiation method becomes much easier with the polar representation, especially if the power is a negative integer or if the power itself is complex! Other operations become more cumbersome to implement and expensive to compute, e.g., addition and subtraction. There is one minor caveat that makes these two implementations not completely interchangeable: floating point precision may influence the two implementations in different ways. For example, the rectangular implementation represents 3 + 4i exactly, whereas the polar implementation must store an approximation to the angle θ since it is an irrational number. Such difference should not effect a well-behaved client, but we should be aware of any places where an encapsulated data type leaks information.

  11. Statistics. Encapsulation enables us to easily replace one implementation with another. This is desirable when we anticipate creating an improved implementation with better performance or accuracy properties. We consider a simple, but representative example below. Our goal is to store statistics for a set of real values, say the sample mean x bar and the sample variance s2.

    mean and variance

    We want our data type to support the following methods: add, size, mean, stddev, variance. OnePass.java stores n, the sum, and the sum of squares. TwoPass.java stores n, the sum, and all of the data values in an array. The one pass algorithm is faster and uses substantially less space, but is susceptible to roundoff error. StableOnePass.java is a well-engineered alternative.

    stable one pass algorithm for sample mean and variance

    It is numerically stable and does not require an array to store the elements.

    public void add(double x) {
       n++;
       s = s + 1.0 * (n-1) / n * (x - m) * (x - m);
       m = m + (x - m) / n;
    }
    
    public double mean()     { return m;            }
    public double variance() { return s / (n - 1);  }
    

  12. Encapsulation. Why does the following break encapsulation, even though all instance variables are declared private.
    public class Appointment {
        private Date date;
        private String customer;
        public Appointment(Date date) {
            // check that date is in some legal range
            this.date = date;
        }
        public Date getDate() { return date; }
    
    Answer: The reason is that the class Date is mutable. The method setDate(seconds) changes the value of the invoking date to the number of milliseconds since January 1, 1970, 00:00:00 GMT. This has the unfortunate consequence that when the function d = getDate() returns the date, the client program can invoke d.setDate() and change the date in an Appointment object type, perhaps setting it to an illegal value for a member of Appointment. Must not let references to mutable objects escape since caller can then modify its state. One solution is to create a defensive copy of the Date before returning it using new Date(date.getTime()); also need to do a defensive copy when storing it via this.date = new Date(date.getTime()). Many programmers regard the mutability of Date as a design flaw. (GregorianCalendar is a more modern Java library for storing dates; but it is mutable too.)
  13. Sparse vector. Create a data type for sparse vectors. Represent a sparse vector by an array of indices (of nonzeros) and a parallel array of the corresponding nonzero values. Assume the indices are in ascending order. Implement the dot product operation.

  14. Genome. Implement a data type to store the genome of an organism. Biologists often abstract away the genome to a sequence of nucleotides (A, C, G, or T). The data type should support the method addNucleotide, nucleotideAt(int i), and doSomeComputation. Perhaps change to addCodon. Advantages of encapsulation: can check that only legal nucleotides are added, can change to more time or memory efficient implementation without affecting client.
    • StringGenome.java has one instance variable of type String. It implements addNucleotide with string concatenation. Each method call takes time proportional to the size of the current genome. Not practical spacewise either for large genomes since nucleotide is stored as a 16-bit char.
    • Genome.java implements a genome as an array of characters. The size of the array is doubled when the array fills up. The method addNucleotide is now constant time. Space consumption is still 16 bits per nucleotide.
    • CompactGenome.java implements a genome as boolean array. We need to use two bits per nucleotide since there are 4 different nucleotides. As in the previous implementation, we use a dynamic array with repeated doubling. Now, each nucleotide consumes 2 bits of storage (instead of 16).
  15. Given a Genome Include a method to return the reverse-complemented genome. Include a method for testing for equality.
  16. Add methods to Date to check which season (Spring, Summer, Fall, Winter) or astrological sign (Pisces, Libra, ...) a given date lies. Be careful about events that span December to January.
  17. Copy constructor. Only needed if data type is mutable. Otherwise, assignment statement works as desired.
    public Counter(Counter x) {
        count = x.count;
    }
    
    Counter counter1 = new Counter();
    counter1.hit();
    counter1.hit();
    counter1.hit();
    Counter counter2 = new Counter(counter1);
    counter2.hit();
    counter2.hit();
    StdOut.println(counter1.get() + " " + counter2.get());    // 3 5
    


Engineering.

Engineering involves the design, implementation, and analysis of a useful solution to a practical problem. The design is limited by resource constraints including time, money, usability, flexibility for future modification, maintainability, safety, esthetics, marketing, and ethics. The analysis models how well the design will perform and whether it will be the project constraints. Prototyping, testing, simulation are all components of a proper analysis. Designing quality software is similar in many ways to designing bridges. "A good scientist is a person with original ideas. A good engineer is a person who makes a design that works with as few original ideas as possible." -- Freeman Dyson. When programming, use the principle of least surprise.

Software engineering.

Programming is a form of engineering: apply computer science to the benefit of humankind. Unlike other forms of engineering (which have been around for hundreds of years), software engineering is very young (~30 years) and no widespread consensus on best practice and engineering standards. Nevertheless, some key ideas have emerged (modular programming, data abstraction, encapsulation). OOP is one of the main tools for building complex programs. We will also consider other software engineering aspects that complement OOP including specification, design, coding, and testing. a computer program to solve a problem. Testing, unit tests (testing smallest compilable element in isolation), integration tests (testing several interacting modules), functional tests tests, regression testing... Test characteristics: functional (does it do what it should do, and not do what it should not), performance, usability, security, etc.) Test detail: unit test (one xxx), integration (several interacting modules), system (the whole thing). Test access: black box (treat module as a black box), white box (use internal structure to determine test cases). Each test should return true or false. Specification is often the biggest source of errors. Compile-time, run-time, and logic errors. Compiler error messages can be cryptic, but usually can be resolved after a few minutes. On the other hand logic errors are the bane of programming. Can take hours to days to discover the error, even if it was just a simple typo. We should be thankful of all the errors that the compiler discovers, and it is a blessing to use a language and compiler that facilitate discovering errors at compile time.

In practice, companies hire employees to solely perform testing. Ratio of testers to programmers can range from 10:1 to 1:10.

Fred Brooks hypothesizes that the number of lines of code a programmer writes per day is about the same, regardless of which language they use (Java, C, machine language). One of main goals of using a higher level programming language (e.g., Java) and OOP is so that we can write 10 lines of Java code that replaces 1,000 lines of machine language code. Another goal is to write programs that are easier to debug than machine language programs. OOP helps address both goals.

Software life-cycle. Analysis and specification, design, construction, testing, operation, maintenance. Waterfall model: do all of the tasks in sequence, reaching decisive milestones along the way. (Inflexible, especially if design is wrong.) Prototype model: build quick prototype, evaluate, and redesign if needed. (harder to manage and throw away previous designs)

Testing. Design good test cases (especially at boundaries), test coverage (use computer to generate test inputs), unit tests, regression testing. Program should always either complete successfully or output a useful error message. Bugs in computer-aided design (CAD) program nuclear reactor control system, guided missile, spacecraft environmental control can be disastrous. Validation = checking that the software does what the customer specified (producing the right product). Verification = checking that the software has been written according to design spec (producing the product right).

Often, it's convenient to write the tests before the code. If you can't write the tests, you might need to change the interface or specs. If you still can't, you need to spend more time thinking before you begin to write code.

Bugs. CMU study of 30,000 programs: 5-6 defects per 1,000 lines of code in production software. New programs: 100 defects per 1,000 lines of code. $60 million bug (misused break statement in C program) in ATT long distance system - 114 switching nodes crashed on January 15, 1990 collection of software bugs.