Jan 25, 2010

Type system

Types can be primitive, and just hold single pieces of information such as an integer, a floating point, or a character or they can be more complicated composite objects which store multiple pieces of information--some combination of data and, at times, even functionality. In computer science, a type system may be loosely defined as providing a way to associate one (or more) type(s) with each value that can be used in a program. Just like spoken languages, computer languages have grammars that give the rules to construct phrases and sentences.

Type checking

The process of verifying and enforcing the constraints of types – type checking – may occur either at compile-time (a static check) or run-time (a dynamic check). If a language specification requires its typing rules strongly (ie, more or less allowing only those automatic type conversions which do not lose information), one can refer to the process as strongly typed, if not, as weakly typed. The terms are not used in a strict sense.

Static typing

A programming language is said to use static typing when type checking is performed during compile-time as opposed to run-time. In static typing, types are associated with variables not values. Statically typed languages include Ada, AS3, C, C++, C#, F#, JADE, Java, Fortran, Haskell, ML, Pascal, Perl (with respect to distinguishing scalars, arrays, hashes and subroutines) and Scala. Static typing is a limited form of program verification (see type safety): accordingly, it allows many type errors to be caught early in the development cycle. Static type checkers evaluate only the type information that can be determined at compile time, but are able to verify that the checked conditions hold for all possible executions of the program, which eliminates the need to repeat type checks every time the program is executed. Program execution may also be made more efficient (i.e. faster or taking reduced memory) by omitting runtime type checks and enabling other optimizations.

Because they evaluate type information during compilation, and therefore lack type information that is only available at run-time, static type checkers are conservative. They will reject some programs that may be well-behaved at run-time, but that cannot be statically determined to be well-typed. For example, even if an expression always evaluates to true at run-time, a program containing the code

if then 42 else

will be rejected as ill-typed, because a static analysis cannot determine that the else branch won't be taken.[1] The conservative behaviour of static type checkers is advantageous when evaluates to false infrequently: A static type checker can detect type errors in rarely used code paths. Without static type checking, even code coverage tests with 100% code coverage may be unable to find such type errors. Code coverage tests may fail to detect such type errors because the combination of all places where values are created and all places where a certain value is used must be taken into account.

The most widely used statically typed languages are not formally type safe. They have "loopholes" in the programming language specification enabling programmers to write code that circumvents the verification performed by a static type checker and so address a wider range of problems. For example, most C-style languages have type punning, and Haskell has such features as unsafePerformIO: such operations may be unsafe at runtime, in that they can cause unwanted behaviour due to incorrect typing of values when the program runs.

Dynamic typing

A programming language is said to be dynamically typed, when the majority of its type checking is performed at run-time as opposed to at compile-time. In dynamic typing, types are associated with values not variables. Dynamically typed languages include Erlang, Groovy, JavaScript, Lisp, Lua, Objective-C, Perl (with respect to user-defined types but not built-in types), PHP, Prolog, Python, Ruby, Smalltalk and Tcl. Compared to static typing, dynamic typing can be more flexible (e.g. by allowing programs to generate types and functionality based on run-time data), though at the expense of fewer a priori guarantees. This is because a dynamically typed language accepts and attempts to execute some programs which may be ruled as invalid by a static type checker. The term 'dynamic language' means something different ('runtime dynamism') and a dynamic language is not necessarily dynamically typed.

Dynamic typing may result in runtime type errors—that is, at runtime, a value may have an unexpected type, and an operation nonsensical for that type is applied. This operation may occur long after the place where the programming mistake was made—that is, the place where the wrong type of data passed into a place it should not have. This may make the bug difficult to locate.

Dynamically typed language systems, compared to their statically typed cousins, make fewer "compile-time" checks on the source code (but will check, for example, that the program is syntactically correct). Run-time checks can potentially be more sophisticated, since they can use dynamic information as well as any information that was present during compilation. On the other hand, runtime checks only assert that conditions hold in a particular execution of the program, and these checks are repeated for every execution of the program.

Development in dynamically typed languages is often supported by programming practices such as unit testing. Testing is a key practice in professional software development, and is particularly important in dynamically typed languages. In practice, the testing done to ensure correct program operation can detect a much wider range of errors than static type-checking, but conversely cannot search as comprehensively for the errors that both testing and static type checking are able to detect. Testing can be incorporated into the software build cycle, in which case it can be thought of as a "compile-time" check, in that the program user will not have to manually run such tests.

Combinations of dynamic and static typing

The presence of static typing in a programming language does not necessarily imply the absence of all dynamic typing mechanisms. For example, Java, and various other object-oriented languages, while using static typing, require for certain operations (downcasting) the support of runtime type tests, a form of dynamic typing. See programming language for more discussion of the interactions between static and dynamic typing.

Static and dynamic type checking in practice

The choice between static and dynamic typing requires trade-offs.

Static typing can find type errors reliably at compile time. This should increase the reliability of the delivered program. However, programmers disagree over how commonly type errors occur, and thus what proportion of those bugs which are written would be caught by static typing. Static typing advocates believe programs are more reliable when they have been well type-checked, while dynamic typing advocates point to distributed code that has proven reliable and to small bug databases. The value of static typing, then, presumably increases as the strength of the type system is increased. Advocates of dependently typed languages such as Dependent ML and Epigram have suggested that almost all bugs can be considered type errors, if the types used in a program are properly declared by the programmer or correctly inferred by the compiler.[3]

Static typing usually results in compiled code that executes more quickly. When the compiler knows the exact data types that are in use, it can produce optimized machine code. Further, compilers for statically typed languages can find assembler shortcuts more easily. Some dynamically typed languages such as Common Lisp allow optional type declarations for optimization for this very reason. Static typing makes this pervasive. See optimization.

By contrast, dynamic typing may allow compilers to run more quickly and allow interpreters to dynamically load new code, since changes to source code in dynamically typed languages may result in less checking to perform and less code to revisit. This too may reduce the edit-compile-test-debug cycle.

Statically typed languages which lack type inference (such as Java and C) require that programmers declare the types they intend a method or function to use. This can serve as additional documentation for the program, which the compiler will not permit the programmer to ignore or permit to drift out of synchronization. However, a language can be statically typed without requiring type declarations (examples include Haskell, Scala and C#3.0), so this is not a necessary consequence of static typing.

Dynamic typing allows constructs that some static type checking would reject as illegal. For example, eval functions, which execute arbitrary data as code, become possible (however, the typing within that evaluated code might remain static). Furthermore, dynamic typing better accommodates transitional code and prototyping, such as allowing a placeholder data structure (mock object) to be transparently used in place of a full-fledged data structure (usually for the purposes of experimentation and testing). Recent enhancements to statically typed languages (e.g. Haskell Generalized algebraic data types) have allowed eval functions to be written in a statically type checked way.[4]

Dynamic typing typically makes metaprogramming more effective and easier to use. For example, C++ templates are typically more cumbersome to write than the equivalent Ruby or Python code.[citation needed] More advanced run-time constructs such as metaclasses and introspection are often more difficult to use in statically typed languages.

Strong and weak typing

One definition of strongly typed involves preventing success for an operation on arguments which have the wrong type. A C cast gone wrong exemplifies the problem of absent strong typing; if a programmer casts a value from one type to another in C, not only must the compiler allow the code at compile time, but the runtime must allow it as well. This may permit more compact and faster C code, but it can make debugging more difficult.

Some observers use the term memory-safe language (or just safe language) to describe languages that do not allow undefined operations to occur. For example, a memory-safe language will check array bounds, or else statically guarantee (i.e., at compile time before execution) that array accesses out of the array boundaries will cause compile-time and perhaps runtime errors.

Weak typing means that a language implicitly converts (or casts) types when used. Revisiting the previous example, we have:

var x := 5;    // (1)  (x is an integer)
var y := "37"; // (2) (y is a string)
x + y; // (3) (?)

In a weakly typed language, the result of this operation is not clear. Some languages, such as Visual Basic, would produce runnable code producing the result 42: the system would convert the string "37" into the number 37 to forcibly make sense of the operation. Other languages like JavaScript would produce the result "537": the system would convert the number 5 to the string "5" and then concatenate the two. In both Visual Basic and JavaScript, the resulting type is determined by rules that take both operands into consideration. In some languages, such as AppleScript, the type of the resulting value is determined by the type of the left-most operand only.

Safely and unsafely typed systems

A third way of categorizing the type system of a programming language uses the safety of typed operations and conversions. Computer scientists consider a language "type-safe" if it does not allow operations or conversions which lead to erroneous conditions.

var x := 5;     // (1)
var y := "37"; // (2)
var z := x + y; // (3)

In languages like Visual Basic variable z in the example acquires the value 42. While the programmer may or may not have intended this, the language defines the result specifically, and the program does not crash or assign an ill-defined value to z. In this respect, such languages are type-safe; however, if the value of y was a string that could not be converted to a number (eg "hello world"), the results would be undefined. Such languages are type-safe (in that they will not crash) but can easily produce undesirable results.

Now let us look at the same example in C:

int x = 5;
char y[] = "37";
char* z = x + y;

In this example z will point to a memory address five characters beyond y, equivalent to three characters after the terminating zero character of the string pointed to by y. The content of that location is undefined, and might lie outside addressable memory. The mere computation of such a pointer may result in undefined behavior (including the program crashing) according to C standards, and in typical systems dereferencing z at this point could cause the program to crash. We have a well-typed, but not memory-safe program — a condition that cannot occur in a type-safe language.

For more on type system refer to Type System

There was an error in this gadget