Data Types =========== What is a type? Type = set of values + set of operations on these values that possess certain properties Topics to be covered: Data types supported in modern programming languages simple and compound types Type declaration Type inference and type checking Type equivalence, type compatibility, type conversion and coercion Strongly-typed, Weakly-typed and untyped languages Static Vs Dynamic type checking SIMPLE TYPES ------------- Simple Types Predefined: int, float, double, etc in C int, bool, real, etc. in SML All other types are constructed, starting from predefined (aka primitive) types Enumerated : "enum colors {red, green, blue};" in C "datatype colors = red | green | blue" in SML Note: datatype is a keyword used in SML to introduce new types COMPOUND TYPES -------------- Types contruted from other types using TYPE CONSTRUCTORS 1. Cartesian Product: --------------------- Let I represent the integer type (equivalently, the set of integers) and R represent reals. Then the cross product I x R, defined in the usual manner of product of sets, represents the cartesian product type obtained from I and R. i.e., I x R = {(i, r) | i in I, r in R} Products types correspond to "tuples" in SML. They are not supported in typical imperative languages, except with labels (see below). The above product type is denoted int * real in SML. Example (SML): # let v = (2,3.0); val v = (2,3.0) : int * real - type mytype = int * real Note: type is a keyword used to introduce new names (abbreviations) for types that are already known to SML. Datatype is for introducing new types unknown to SML. Note: Just as the cartesian product operator on sets is nonassociative, the `*' type constructor is not associative. Example: # let t = (2,3,4.0); val t = (2,3,4.0) : int * int * real # let s = ((2,3), 4.0); val s = ((2,3),4.0) : (int * int) * real # let u = (2, (3,4.0)); val u = (2,(3,4.0)) : int * (int * real) - t = s; stdIn:21.1-21.6 Error: operator and operand don't agree [equality type required] operator domain: ''Z * ''Z operand: (int * int * real) * ((int * int) * real) in expression: t = s NOTE: compiler complains that the types of arguments to equality operator must be the same, but it is not so in this case. You will get the same message if you try to compare s = u or t = u. NOTE: The equality operator has the type t * t -> bool for any type t. Since we say "for all values of t," t denotes a variable in the mathematical sense. It is called a type variable. SML uses a naming convention by which type variable names start with "'" or "''" and consist of the usual charaters used in identifiers. Elements of a tuple can be extracted using the # operator: - #2(u); val it = (3,4.0) : int * real - #3(t); val it = 4.0 : real - #3(u); stdIn:24.1-24.6 Error: operator and operand don't agree [record labels] operator domain: {3:'Y; 'Z} operand: int * (int * real) in expression: (fn {3=3,...} => 3) u NOTE: the error message says that u does not have a third component. - #2(#2(u)); /* second component of the second component of u */ val it = 4.0 : real Labelled products: ------------------ In pure cartesian products, the components of a tupe do not have names. Instead, they are identified by numbers. In LABELED PRODUCTs, each component of a tuple is given a name. Labeled products are also called RECORDS, which is a language-neutral term. "Struct" is a term that is specific to C and C++. "struct t { int a; float b; char * c;};" in C "type t = { a:int, b:real, c:string};" in SML In SML, components of a labled tuple value v can be accessed using the syntax #(v), just as we accessed the component of an unlabeled tuple using #(v) # let v = {a=3, b=1.0, c="six"}; val v = {a=3,b=1.0,c="six"} : {a:int, b:real, c:string} - #a(v); val it = 3 : int 2. Function Types: ------------------- "->" is the function type constructor T1 -> T2 is a function type; it corresponds to a function that takes arguments of type T1 and returns a value of type T2 Standard ML supports functions as first class values. They can be created and manipulated by other functions. In imperative languages such as C/C++, we can pass pointers to functions, but this does not offer the same level of flexibility. For example, there is no way for a C-function to dynamically create and return a pointer to a function; rather, it can return a pointer to an EXISTING function. For this reason, our discussion of function types will be focussed mainly on SML. Examples in SML: ---------------- # let f x = x*x; val f = fn : int -> int # let g x y = x*(y:real); val g = fn : real -> real -> real NOTE: g is different from h given below. g takes two arguments, which can be supplied one at a time (see example two lines below). h takes only one argument, which is itself a tuple consisting of two components. # let h (x, y) = x * (y:real); val h = fn : real * real -> real # let v = g 3.0; val v = fn : real -> real NOTE: when g is given one argument, it returns a new function value. The type of g is real -> real -> real. Since the '->' operator is right-associative, we read the type as real -> (real -> real). In other words, g, when given an argument of type real, returns a value of type (real -> real). # let u = v 2.0; val u = 6.0 : real NOTE: when a real argument is given to v, it consumes it and produces an output value of type real. v is called a "closure" -- it represents a function for which some of the arguments have been provided, but its evaluation cannot proceed unless additional arguments are provided. The closure "remembers" the arguments supplied so far, so that when the remaining arguements are provided at a later time, evaluation can proceed. 3. Union types -------------- Union types correspond to set unions, just like product types corresponded to cartesian products. Unions can be tagged or untagged. C/C++ support only untagged unions: union v { int ival; float fval; char cval; }; In untagged unions, there is no way to ensure that the component of the right type is always accessed. For instance, an integer value may be stored in the above union. Due to a programming error, the fval field may be accessed at a later time. Clearly, the fval field does not contain a valid value at this point, so you will get some garbage. With tagged unions, the compiler can perform checks at runtime to ensure that the right components are accessed. Tagged unions are NOT supported in C/C++. Other languages such as Pascal support tagged unions using VARIANT RECORDs RECORD CASE b: BOOLEAN OF TRUE: i: INTEGER; | FALSE: r: REAL END END Tagged union is also called a "discriminated union," while untagged union is called an "undiscriminated union." Tagged unions are supported in OCAML using datatype declarations. # type tt = Floatval of float | Intval of int;; type tt = Floatval of float | Intval of int # let v = Floatval (2.0);; val v : tt = Floatval 2. # let u = Intval (3);; val u : tt = Intval 3 # let add (x, y) = match (x, y) with (Intval x1, Intval x2) -> Intval(x1+x2) | (Floatval x1, Floatval x2) -> Floatval(x1+.x2);; Warning 8: this pattern-matching is not exhaustive. Here is an example of a value that is not matched: (Floatval _, Intval _) val add : tt * tt -> tt = # add (u, v);; Exception: Match_failure ("//toplevel//", 14, 3). # let w = Intval(3);; val w : tt = Intval 3 # add(u,w);; - : tt = Intval 6 NOTE: we can redefine add as follows so as to permit addition of reals and ints. We add two new rules, and in each rule, convert the integer type to real type explicitly. Note that the warning about "nonexhaustive match" is now gone. # let add (x, y) = match (x, y) with (Intval x1, Intval x2) -> Intval(x1 + x2) | (Floatval x1, Floatval x2) -> Floatval(x1 +. x2) | (Intval x1, Floatval y1) -> Floatval(float_of_int(x1) +. y1) | (Floatval x1, Intval y1) -> Floatval(x1 +. float_of_int(y1));; val add : tt * tt -> tt = 4. Array types --------------- Array construction is denoted by array(, ). Thus, a C-declaration int a[5]; defines a variable a of type "array(0-4, int)" A declaration union tt b[6][7]; declares a variable b of type "array(0-4, array(0-6, union tt))" Sometimes we may not consider the range as part of an array variable's type. This corresponds to treating "array" as type constructor that takes just a single argument, namely, the element type. 5. Pointer types ------------------ A pointer type will be denoted using the syntax ptr() where denote the types of the object pointed by a pointer type. The C-declaration char *s; defines a variable s of type ptr(char), whereas a declaration int (*f)(int s, float v) defines a (function) pointer of type ptr(int*float -> int) Recursive types --------------- Recursive type: a type defined in terms of itself. Example in C: struct IntList { int hd; intList tl; } Does not work: this defn corresponds to an infinite list there is no end, because there is no way to capture the case when the tail has the value "nil". Need to express that tail can be nil or be a list Try: variant records: TYPE charlist = RECORD CASE IsEmpty: BOOLEAN OF TRUE: /* empty list */ | FALSE: data: CHAR; next: charlist; END END Still, languages that support variant records dont permit this declaration. Problem: Cannot predict amount of storage needed. Soloution in typical imperative languages: Use pointer types to implement recursive type, as follows: struct IntList { int hd; IntList *tl; } Now, tl can be: a NULL pointer (i.e., nil or empty list) or point to a nonempty list value. Now, IntList structure occupies only a fixed amount of storage Direct definition of recursive types IS supported in OCAML using datatype declarations. # type intBtree = LEAF of int | NODE of int * intBtree * intBtree;; type intBtree = LEAF of int | NODE of int * intBtree * intBtree We are defining a binary tree type inductively: Base case: a binary tree with one node, called a LEAF Induction case: construct a binary tree by constructing a new node that sores an integer value, and has two other binary trees as children We may construct values of this type as follows: - let l = LEAF(1);; val l = LEAF 1 : intBtree - let r = LEAF(3);; val r = LEAF 3 : intBtree - let n = NODE(2, l, r);; val n = NODE (2,LEAF 1,LEAF 3) : intBtree Types can be mutually recursive. Consider the following type declaration: # type expr = PLUS of expr * expr | PROD of expr * expr | FUN of (string * exprs) | IVAL of int and exprs= EMPTY | LIST of expr * exprs;; type expr = PLUS of expr * expr | PROD of expr * expr | FUN of (string * exprs) | IVAL of int and exprs = EMPTY | LIST of expr * expr The key word "and" is used for mutually recursive type definitions. ("and" is also used for mutually recursive function defitions.) We could also have defined expressions using the predefine list type: type expr=PLUS of expr*expr | PROD of expr*expr | FUN of string * expr list;; type expr = PLUS of expr * expr | PROD of expr * expr | FUN of string * expr lis Examples: The expression "3 + (4 * 5)" can be represented as a value of the above datatype expr as follows: let v3 = IVAL(3);; let v5 = IVAL(5);; let v4 = IVAL(4);; let pr = PROD(v5, v4);; let pl = PLUS(v3, pr);; The following picture illustrates the structure of the value "pl" and how it is constructed from other values. pl ------> PLUS / \ / \ v3 ---> IVAL PROD <----- pr | /\ | / \ 3 /->IVAL IVAL <--- v4 / | | v5 | | 5 4 Similarly, "f(2,4,1)" can be represented as: let a1 = EMPTY;; let a2 = ARG(IVAL(4), a1);; let a3 = ARG(IVAL(2), a2);; let fv = FUN("f", a3);; Note the use of "expr list" to refer to a list that consists of elements of type "expr." Polymorphism: ------------- Ability of a function to take arguments of multiple types. The primary use of polymorphism is code reuse. Functions that call polymorphic functions can use the same piece of code to operate on different types of data. NOTE: Polymorphism will be discussed through the course. It is a very important concept. Overloading (aka adhoc polymorphism): Same function NAME is used to represent different functions, but -- implementations may be different -- arguments may have different types Example: operator '+' is overloaded in most languages so that they can be used to add integers or reals. But implementation of integer addition differs from float addition. Arguments for integer addition or ints, for float addition, they are floats Arbitrary function names can be overloaded in C++, but not in C. All virtual functions are in fact overloaded functions. NOTE: Overloading is called aka adhoc polymorphism mainly because of the historical reason, not because it is special in today's use. Actually, overloading is used quite often nowadays. NOTE: In C++, we have virtual and nonvirtual functions. In Java, all the functions in classes are in fact virtual. Parametric polymorphism: -- same function works for arugments of different types -- same code is reused for arguments of different types. -- allows reuse of "client" code ((i.e., code that calls a polymorphic function) as well Overloading: -- due to differences in impl. of overloaded functions, there is no code reuse in their implementation -- but client code is reused Parametric polymorphism in C++ template Type min(const C* a, int size, C minval) { for (int i = 0; i < size; i++) if (a[i] < minval) minval = a[i]; return minval; } The above example shows how the same code can be used for computing the minimum value from arrays of any type. The only requirement is that the type support the comparison operator "<" and the assignment operator "=". The above function definition is parameterized wrt to the type "class C", and hence the term "parametric polymorphism". The constraint regarding this type, namely that it support the operations "<" and "=", is implicit from the body of the template function. Unlike C++, C does not support templates. Templates were not supported initially in Java, but added later on. ML also supports parametric polymorphism. e.g. val len l = match l with x::xs -> 1+len(xs) | [] -> 0; ML will say: val len 'a list -> int This function takes a list of any type and return the length of that list. It is parametrically polymorphic since it applies to list whose elements may be of different types. (The type parameter 'a captures this fact.) Code reuse with parametric polymorphism and overloading: It is clear that with parametric ploymorphism, the same function body is reused to deal with different datatypes. The basic property of a parametrically polymorphic function is that it does not need to "look below" a certain level of some data structure. For instance, the "min" function above did not need to look inside of each array element. Similarly, one can think of "length" and "append" functions that can operate on linked lists of all types, without looking at the specific element type. With oveloading, there is no reuse of the overloaded function, since there is in fact a different function body corresponding to each argument type. However, client code that calls a overloaded function can be reused. For instance, let C be a class, with subclasses C1,...,Cn. Also let "f" be a virtual method of class C (and hence its subclasses). We can now write client code that can apply the function uniformly to elements of an array, each of which is a pointer to an object of type C1,...,Cn. void g(int size, C *a[]) { for (int i = 0; i < size; i++) a[i]->f(...); } Now, the body of function g (which is a client of the function f) can be reused for arrays that contain objects of type C1 or C2 or ... or Cn, or even a mixture of these types. In SML: ------- parameterized types are defined as follows: type () = type ('a, 'b) pairList = ('a * 'b) list; We can also define datatype declarations for parameterized data types. Define Btree: - datatype ('a,'b) Btree = LEAF of 'a = | NODE of 'b * ('a,'b) Btree * ('a,'b) Btree; datatype ('a,'b) Btree = LEAF of 'a | NODE of 'b * ('a,'b) Btree * ('a,'b) Btree - type intBTree = (int, int) Btree; type intBTree = (int,int) Btree The type intBTree is structurally identical to the iTree type defined below: - datatype iTree = LEAF of int | NODE of int * iTree * iTree; datatype iTree = LEAF of int | NODE of int * iTree * iTree Example functions and their types ---------------------------------- # let leftmost l = match l with LEAF(x) -> x | NODE(y, l, r) -> leftmost(l); val leftmost = fn : ('a,'b) Btree -> 'a Computes the leftmost element of a Btree. The return type would be the same as the type of the argument of the LEAF constructor, and hence the type ('a,'b) Btree -> 'a for the function. # let discriminants l = match l with LEAF(x) -> [] | NODE(y, l, r) -> (discriminants l) @ (discriminants r) Note: "@" is the operator in ML for concatenating two lists to produce a new list. The above function operates by traversing the list recursively, and collecting the discriminant values in all nodes of the tree into a single list. The traversal order is left-to-right, i.e., an inorder traversal of the tree. The return value is a list of elements, each of which has the same type as the first argument of NODE, i.e., type 'b. Hence the function has the type ('a,'b) Btree -> 'b list NOTE: An example to show the result of function "discriminants": 6 / \ 4 8 <------NODE(8, LEAF(7),LEAF(9)) / \ / \ 3 5 7 9 <-----LEAF(9) The value of discrimiants on this tree is [4,6,8]. # let append l1 l2 = match l1 with x::xs -> x::(append xs l2) | [] -> l2 val append = fn : 'a list * 'a list -> 'a list This function concatenates two lists. (Performs the same function as the "@" operator). It does not care about the type of elements in the list -- it does not even look at their values. Thus, the function is polymorphic w.r.t. to the type of elements in the list. However, we have to ensure that all elements of a list have the same type. Since elements from both argument lists appear in the output list, this means that the element types in both argument lists must be the same. This is captured by the type 'a list * 'a list -> 'a list If we make a slight modification to append, so that in the second rule, it discards l2 rather than returning l2. # let f l1 l2 = match l1 with x::xs -> x::(f xs l2) | [] -> [] val f = fn : 'a list * 'b -> 'a list In this case, the function would completely ignores the second argument. Hence the second argument type does not matter at all, leading to the type: 'a list * 'b -> 'a list SML Operators that restrict polymorphism: Arithmetic, relational, boolean, string, type conversion operators SML Operators that allow polymorphism tuple, projection, list, equality ("=" and "<>") Type equivalence ------------------- 1. Structural equivalence: two types are equivalent if they are defined by identical type expressions. -- Issues: what components are counted? array range? record labels? Typically: array ranges are not considered as part of the type, but record labels are considered part of the type. 2. Name equivalence: two types are equal if they have the same name. 3. Declaration equivalence: two types are equivalent if their declarations lead back to the same original type expression by a series of redeclarations. Structural equivalence is the lest restrictive notion of type equivalence among the above three, while name equivalence is the most restrictive. Declaration equivalence is in between these two notions. Pure name equivalence is not very useful: TYPE t1 = ARRAY [1..10] of INTEGER; TYPE t2 = t1; t1 and t2 are not equivalent. In v1: ARRAY [1..10] OF INTEGER; v2: ARRAY [1..10] OF INTEGER; v1 and v2 have different types! NOTE: Here, the type "ARRAY [1..10] OF INTEGER" doesn't have a name. But they are assigned random and temporary names. Since the temporary names are distinct, v1 and v2 have different types. NOTE: In name equivalence, two types are equivalent unless we get one from redefinition of the other. Using the notion of structural equivalence, t1 and t2 are equivalent; v1 and v2 have equivalent types. Using declaration equivalence, t1 and t2 are equivalent, but the types of v1 and v2 are treated to be distinct. Declaration equivalence: types whose declarations lead back to the same original type expression by a series of redeclarations (of the form type t1 = t2, or typedef TE tn). Pascal, Modula use decl equivalence In C, decl equiv used for structs and unions, structual equivalence for other types such as pointers and arrays. struct { int a ; float b ; } x ; struct { int a ; float b ; } y ; then x and y are structure equivalent but not declaration equivalent. typedef int* intp ; typedef int** intpp ; intpp v1 ; intp *v2 ; then v1 and v2 are structure equivalent. Note: -- Typedef in C is used to give new name to existing types. -- int* v1; int* v2; v1 and v2 are not declaration equivalent. -- struct X {int a;}; typedef X Y; X v1 ; Y v2; ----> v1, v2 are declaration equivalent. X* v1; Y* v2; ----> v1, v2 are not declaration equivalent. Type compatibility : Weaker notion than type equivalence. Type compatibility rules are different for different operators. We give an example here, corresponding to the assignment operator. v = expr is meaningful if type of is compatible with type of v. Compatibility in this case is based on a subtype relationship, i.e., if the type of expr is a subtype of the type of v, then there is compatibility. Example: in most languages, assigning integer value to a real variable is permitted, since integer is a subtype of real. in OO-languages, an object of a derived type can be assigned to an object of the base type. (Actually, this holds for pointers to objects, rather than the objects themselves.) Procedure parameter passing uses the same notion of comptibility as assignment. (There may be exceptions to this in some languages.) This is because a procedure call can be though of as proceding in two steps: -- assignment of actual parameter expressions to the formal parameters of the procedure -- execution of the procedure body Note: formal parameters are the parameter names that appear in the function declaration. The actual parameters are the expressions that appear at the point of function call. Type checking -------------- Static (compile time) : Benefits - no run-time overhead - programs safer/more robust Dynamic (run-time) : Disadvantages -- runtime overhead -- maintaining type info at runtime -- performing type checks at runtime Benefits - more flexible / more expressive e.g., C++ allows casting of subclass to superclass (always type-safe). superclass to subclass (not necessarily type-safe). Since C++ is statically typed, there is no way to check if the second form of casting is safe at runtime. In contrast, Java uses a combination of static and dynamic type-checking. This enables Java to check the potentially unsafe casts at runtime to see if they are OK. Strongly typed language: programs in such languages will execute without producing uncaught type errors at runtime e.g., no invalid memory access -- no seg fault -- array index out of range -- access of null pointer Type checking relies on type compatibility and type inference rules. Type inference rules are used to infer types of expressions. e.g., type of (a+b)+c is inferred from type of a, b and c and the inference rule for operator + Type inference rules typically operate on a bottom-up fashion. Given type of e1 and w2, the fule specifies the type of e1 op e2 for each operator op. e.g., +:float /\ / \ / \ / \ +:float c:float /\ / \ / \ / \ a:int b:float In OCAML, type inference rules cature bottom-up as well as top-down flow of type info. Top-down: let f x y : float = x+.y f:float /\ / \ / \ / \ x:float y:float Here types of x and y are inferred from return type of f (float). Most of the time OCAML programs don't require type declaration. Type conversion EXPLICIT: Functions are used to perform conversion. strtol, atoi, itoa in C; real and int etc. IMPLICIT conversion (coercion) e.g., if a is real and b is int then type of a+b is real Before doing the addition, b must be converted to a real value. This conversion is done automatically. Casting (as in C) Invisible conversion: in unions If you store a value in the integer field of a union, but then access a float field, then you will get unpredictable results: it would be the hardware representation of an integer interpreted as a floating point number.