Previous: introduction Up: ../chrrtn.html Next: support-criteria
The problems with FORTRAN 66 Hollerith data are well-known, and although the KARxxx routines largely removed them, when Hollerith support is no longer available, FORTRAN 77 CHARACTER data will have to be used. In the view of the author, the definition of CHARACTER data in the 1977 FORTRAN Standard was very poorly done, and has done significant harm to FORTRAN software portability. This is a strong statement, and it bears some explanation. First of all, the Hollerith data type is dropped from the 1977 Standard. This means that a very large body of existing FORTRAN software which uses character data, even in an at-present widely portable fashion, may require extensive changes to run with a FORTRAN 77 compiler, unless manufacturers can be pressed to continue support of character data stored in Hollerith constants and variables. The 1977 standard prohibits all storage equivalencing, either via COMMON and EQUIVALENCE statements, or by FUNCTION or SUBROUTINE argument associations, between CHARACTER data and all other FORTRAN data types. This is in sharp contrast to the usual lax implementations of FORTRAN for all other data types. This was necessary to enable FORTRAN 77 to support character strings of indefinite length, so that declarations of the form SUBROUTINE A (B) CHARACTER B*(*) could be permitted, allowing CHARACTER variables to inherit a string length from a calling program. This forces a compiler to generate code to pass to a called routine the address of a string descriptor containing size information as well the actual address of the character data. Also, on word-addressed machines, CHARACTER data may begin in the middle of a word, so storage equivalencing could be problematic. Second, standardized library support of character data in the form of useful utility routines is non-existent in the 1977 Standard, apart from the ICHAR and CHAR functions for converting between INTEGER and CHARACTER form. Third, null character strings, that is, strings of zero length, are not permitted. Null strings are in fact quite useful, and indeed, even necessary in some applications. In particular, a null string cannot be simulated by any string of non-zero length. Fourth, the 1977 Standard does not specify the character set to be used. The fact that many manufacturers employ their private versions of character sets, each with its own special character repertoire and collating sequence, only continues to perpetrate additional machine dependence upon FORTRAN users. Fifth, the 1977 Standard in allowing declarations of the form CHARACTER*n did not specify what minimum 'n' should be supported by a standard conforming compiler. One might hope that this would not be less than the number of characters that could reside in the host machine's (possibly virtual) address space. At the least, one might conclude that an assignment of the form "A='long string'" spanning the permitted 19 continuation lines would be permitted. Alas, few compilers permit even this much, and string length limitations of 128, 256, and 512 are common, and only a few (e.g. ElXsi and DEC-20) set the limit at the machine address space size. Interestingly, the 1977 Standard clearly states that a CHARACTER*n argument passed to a subprogram can be legally received as an array of n CHARACTER*1 values, and vice versa. Since none of the compilers seem to put a limit on array sizes, it is odd that they do so on string lengths. The reason of course is the peculiar requirement of the Standard that the LEN() function be able to return the declared length of its argument string; no such function is provided for obtaining the declared dimension of an array. Most implementations therefore represent a CHARACTER variable by a string descriptor containing a length field and an address field, and both of these have fixed sizes allotted to them. It seems foolish that although most architectures now require 24 or more bits for the address field, only 7, 8, 9 or 16 should be allocated for the length field to "save storage". Sixth, although the 1977 Standard removed many of the unreasonable restrictions on where expressions could be permitted in FORTRAN source code, it introduced a new one in the form of prohibiting taking a substring of a constant or an expression! If one examines string support and typical use thereof in languages like PL/1 and C, two characteristics become evident. First of all, strings whose length can vary dynamically (up to some compile time limit set by the user, not by the compiler) are supported, and the null string is legal. Having varying length strings without a null string is like having integers without a zero; how else can something be initialized to empty? FORTRAN 77, Pascal, Modula/2, and Ada, all make the mistake of requiring fixed length strings, and in Pascal and Ada, because of their strong typing, strings of different lengths have different types, and are therefore not conformant. Second, individual characters can be processed as equivalents of small integers equal to their position in the host character set. Thus, in C, one can convert a lower case letter to upper case by adding the expression 'A' - 'a', without having to know precisely what the equivalent integers are. In additions, printable representations of commonly used non-printable characters, such as backspace, tab, newline, carriage return, formfeed, and so on, are provided, so that one can easily construct strings which span lines or contain control characters. The integer equivalents make it possible to index arrays by character values, making for efficient lookup. C in particular makes good use of this in its standard library for determining whether characters are letters, digits, printable, upper case or lower case, etc. FORTRAN 77 has the ICHAR() function which is supposed to return an integer ordinal greater or equal to 0, representing the position of the argument character in the host character set. However, the Standard defines only letters, digits, and thirteen special characters, a total of only 49. This means that a processor is free to implement whatever it likes for arguments to ICHAR() which are not among these. It could even legally raise a fatal error in such a case. Most implementations do in fact return an integer for all possible characters which can be stored in the host CHARACTER storage unit, but the sign of the integer is not guaranteed to be positive. On older architectures with 36 (IBM 70xx, Univac 1108), 48 (Burroughs), or 60 (CDC) bit words, 6-bit characters were common, and an even number fit into the host words. This only permits 64 different characters, which is not enough to have both letter cases. The ISO/ASCII character set has 128 different character values and can represent both upper case and lower case letters. On the 36-bit DEC-10 and -20 machines, these are stored as five 7-bit characters per word, with one unused bit. On the 36-bit Univax 11xx machines, newer compilers store four 9-bit characters per word, with the two high order bits of each character unused. Most newer architectures are based on an address unit of an 8-bit byte, or have a word size which is a multiple of this (e.g. 64-bit Cray words). The EBCDIC character set used by IBM has 256 characters to make complete use of the byte storage unit. With the ASCII character set, however, only 7 bits are required, and something has to be done about the extra bit in an 8-bit byte. Prime makes it one and treats the byte as an unsigned integer, so their ASCII ordinals go from 128 to 255 (a violation of the ANSI and ISO Standards, I believe). Other machines ignore it, and still others use it as a sign bit. In the latter case, ICHAR() can return values 0 .. 127 when the high bit is zero, then -128 .. -1 when it is set. In summary then, one cannot be sure in FORTRAN 77 whether CHARACTER data can be used to access every bit in memory (it does not on the DEC-10 and -20, or on any machine which ignores the high-order bits), or whether ICHAR() can be used to obtain an integer which can confidently be used as an array index.