Catalog files are used by the sgml widget's parser to create system identifiers for external entities. This chapter describes the format of catalogs files and the systax of the generated system identifiers.
The entity manager generates a system identifier for every external entity using catalog entry files in the format defined by ``SGML Open Technical Resolution TR9401:1995''. The entity manager will give an error if it is unable to generate a system identifier for an external entity. Normally if the external identifier for an entity includes a system identifier then the entity manager will use that as the effective system identifier for the entity; this behaviour can be changed using OVERRIDE or SYSTEM entries in a catalog entry file.
A catalog entry file contains a sequence of entries in one of the following forms:
This specifies that sysid should be used as the effective system identifier if the public identifier is pubid. Sysid is a system identifier as defined in ISO 8879 and pubid is a public identifier as defined in ISO 8879.
This specifies that sysid should be used as the effective system identifier if the entity is a general entity whose name is name.
This specifies that sysid should be used as the effective system identifier if the entity is a parameter entity whose name is name. Note that there is no space between the % and the name.
This specifies that sysid should be used as the effective system identifier if the entity is an entity declared in a document type declaration whose document type name is name.
This specifies that sysid should be used as the effective system identifier if the entity is an entity declared in a link type declaration whose link type name is name.
This specifies that sysid should be used as the effective system identifier for a notation whose name is name. This is an extension to the SGML Open format. This is relevant only with the -n option.
bool may be YES or NO. This sets the overriding mode for entries up to the next occurrence of OVERRIDE or the end of the catalog entry file. At the beginning of a catalog entry file the overriding mode will be NO. A PUBLIC, ENTITY, DOCTYPE, LINKTYPE or NOTATION entry with an overriding mode of YES will be used whether or not the external identifier has an explicit system identifier; those with an overriding mode of NO will be ignored if external identifier has an explicit system identifier. This is an extension to the SGML Open format.
This specifies that sysid2 should be used as the effective system identifier if the system identifier specified in the external identifier was sysid1. This is an extension to the SGML Open format. sysid2 should always be quoted to ensure that it is not misinterpreted when parsed by a system that does not support this extension.
This specifies that if the document does not contain an SGML declaration, the SGML declaration in sysid should be implied.
This specifies that the document entity is sysid. This entry is used only with the -C option.
This specifies that sysid is the system identifier of an additional catalog entry file to be read after this one. Multiple CATALOG entries are allowed and will be read in order. This is an extension to the SGML Open format.
This specifies that relative storage object identifiers in system identifiers in the catalog entry file following this entry should be resolved using first storage object identifier in sysid as the base, instead of the storage object identifiers of the storage objects comprising the catalog entry file. This is an extension to the SGML Open format. This extension is proposed in ``Using SGML Open Catalogs and MIME to Exchange SGML Documents''.
This specifies that entities with a public identifier that has pubid-prefix as a prefix should be resolved using a catalog whose system identfier is sysid. For more details, see `` A Proposal for Delegating SGML Open Catalogs''. This is an extension to the SGML Open format.
The delimiters can be omitted from the sysid provided it does not contain any white space. Comments are allowed between parameters delimited by - as in SGML.
The environment variable SGML_CATALOG_FILES contains a list of catalog entry files. The list is separated by colons under Unix and by semi-colons under MS-DOS and Windows.. These will be searched after any catalog entry files specified using the -m option, and after the catalog entry file called catalog in the same place as the document entity. If this environment variable is not set, then a system dependent list of catalog entry files will be used. In fact catalog entry files are not restricted to being files: the name of a catalog entry file is interpreted as a system identifier.
A match in one catalog entry file will take precedence over any match in a later catalog entry file. A more specific matching entry in one catalog entry file will take priority over a less specific matching entry in the same catalog entry file. For this purpose, the order of specificity is (most specific first):
There are two kinds of system identifier: formal system identifiers
and simple system identifiers. A system identifier that does not
start with < will always be interpreted as a simple
system identifier. A simple system identifier will always be
interpreted either as a filename or as a URL.
Formal system identifiers are based on the System Identifier facility
defined in ``ISO/IEC 10744 (HyTime) Technical Corrigendum 1'', Annex D.
A system identifier that is a formal system
identifier consists of a sequence of one or more storage object
specifications. The objects specified by the storage object
specifications are concatenated to form the entity. A storage object
specification consists of an SGML start-tag in the reference concrete
syntax followed by character data content. The generic identifier of
the start-tag is the name of a storage manager. The content is a
storage object identifier which identifies the storage object in a
manner dependent on the storage manager. The start-tag can also
specify attributes giving additional information about the storage
object. Numeric character references are recognized in storage object
identifiers and attribute value literals in the start-tag. Record
ends are ignored in the storage object identifier as with SGML. A
system identifier will be interpreted as a formal system identifier if
it starts with a < followed by a storage manager name,
followed by either > or white-space; otherwise it will be
interpreted as a simple system identifier. A storage object
identifier extends until the end of the system identifier or until the
first occurrence of < followed by a storage manager
name, followed by either > or white-space.
The following storage managers are available:
The storage object identifier is a filename. If the filename is
relative it is resolved using a base filename. Normally the base
filename is the name of the file in which the storage object
identifier was specified, but this can be changed using the base
attribute. The filename will be searched for first in the directory
of the base filename. If it is not found there, then it will be
searched for in directories specified with the -D option in the
order in which they were specified on the command line, and then in
the list of directories specified by the environment variable
SGML_SEARCH_PATH. The list is separated by colons under Unix
and by semi-colons under MSDOS.
The storage object identifier is an integer specifying a file
descriptor. Thus a system identifier of <osfd>0 will refer to
the standard input.
The storage object specifier is a Tcl channel identifier that refers
to an open channel. Thus a system identifier like
<channel>file4 will refer to the channel that has been
previously opended and returned the channel identifier
file4. The channel has to be registered in the same Tcl
interpeter that owns the sgml widget.
The storage object specifier is a Tcl command that is called repeatedly to retrieve the storage object. The command must return a non-empty string on each invocation or an empty string when the complete storage object has been delivered to the sgml widget. After an empty string has been returned, the sgml widget will assume that the complete storage object has been read and will not call the command again.
On output, the Tcl command will be called repeatedly with a single string argument that should be stored by the command. The last invocation of the command (after the complete storage object has been transferred) will receive an empty string so that the Tcl command can perform cleanup operations, close open files, etc.
On output, no byte order mark is prepended and no data encoding is performed by the SGML widget. The arguments are passed as Tcl strings in UTF-8 format.
The storage object identifier is a URL. Only the http scheme is currently supported and not on all systems.
The storage manager is the storage manager of storage object in which
the system identifier was specified (the underlying storage
manager). However if the underlying storage manager does not support
named storage objects (ie it is osfd), then the storage manager
will be osfile. The storage object identifier is treated as a
relative, hierarchical name separated by slashes (/) and will
be transformed as appropriate for the underlying storage manager.
The bit combinations of the storage object identifier are the contents of the storage object.
In addition, user-defined storage managers can be used to extend the
range of possible storage objects. See section
for
additional details.
Attributes have to be provided within the start tag that specifies the storage manager. The following attributes are supported:
This describes how records are delimited in the storage object:
The default is find except for NDATA entities for which the default is asis. This attribute is not applicable to the literal storage manager.
When records are recognized in a storage object, a record start is inserted at the beginning of each record, and a record end at the end of each record. If there is a partial record (a record that doesn't end with the record terminator) at the end of the entity, then a record start will be inserted before it but no record end will be inserted after it.
The attribute name and = can be omitted for this attribute.
The default is zapeof except for NDATA entities, entities
declared in storage objects with zapeof=nozapeof and
storage objects with records=asis. This attribute is not
applicable to the literal storage manager.
The attribute name and = can be omitted for this
attribute.
The encoding attribute specifies the encoding of the storage object. This attribute is used when the encoding is independent of the document character set. The value must be the name of an encoding. This attribute is not applicable to the literal storage manager.
This specifies whether line boundaries should be tracked for this object: a value of track specifies that they should; a value of notrack specifies that they should not. The default value is track. Keeping track of where line boundaries occur in a storage object requires approximately one byte of storage per line and it may be desirable to disable this for very large storage objects.
The attribute name and = can be omitted for this attribute.
When the storage object identifier specified in the content of the storage object specification is relative, this specifies the base storage object identifier relative to which that storage object identifier should be resolved. When not specified a storage object identifier is interpreted relative to the storage object in which it is specified, provided that this has the same storage manager. This applies both to system identifiers specified in SGML documents and to system identifiers specified in the catalog entry files.
The value is a single character that will be recognized in storage object identifiers (both in the content of storage object specifications and in the value of base attributes) as a storage manager character reference delimiter when followed by a digit. A storage manager character reference is like an SGML numeric character reference except that the number is interpreted as a character number in the inherent character set of the storage manager rather than the document character set. The default is for no character to be recognized as a storage manager character reference delimiter. Numeric character references cannot be used to prevent recognition of storage manager character reference delimiters.
This applies only to the neutral storage manager. It specifies whether the storage object identifier should be folded to the customary case of the underlying storage manager if storage object identifiers for the underlying storage manager are case sensitive. The following values are allowed:
The default value is fold. The attribute name and
= can be omitted for this attribute.
For example, on Unix filenames are case-sensitive and the customary
case is lower-case. So if the underlying storage manager were
osfile and the system was a Unix system, then
<neutral>FOO.SGM would be equivalent to
<osfile>foo.sgm.
A simple system identifier is interpreted as a storage object identifier with a storage manager that depends on where the system identifier was specified: if it was specified in a storage object whose storage manager was url or if the system identifier looks like an absolute URL in a supported scheme, the storage manager will be url; otherwise the storage manager will be osfile. The storage manager attributes are defaulted as for a formal system identifier. Numeric character references are not recognized in simple system identifiers.
Encodings can be specified e.g. as the value of the encoding attribute in system identifiers. For interoperability with SP/based systems, the environment variable SP_ENCODING can be set to specify an encoding.
Encoding names are case insensitive. The following named encodings are available:
Each character is represented by a variable number of bytes according to UCS Transformation Format 8 defined in Annex P to be added by the first proposed drafted amendment (PDAM 1) to ISO/IEC 10646-1:1993.
This is ISO/IEC 10646 with the UCS-2 transformation format. Each character is represented by 2 bytes. No special treatment is given to the byte order mark character.
Each character is represented by 2 bytes. The bytes representing the entire storage object may be preceded by a pair of bytes representing the byte order mark character (0xFEFF). The bytes representing each character are in the system byte order, unless the byte order mark character is present, in which case the order of its bytes determines the byte order. When the storage object is read, any byte order mark character is discarded.
This is equivalent to the ``Extended UNIX Code Packed Format for Japanese'' Internet charset. Each character is encoded by a variable length sequence of octets.
This is ASCII and KSC 5601 encoded with the EUC encoding as defined by KS C 5861-1992.
This is ASCII and GB 2312-80 encoded with the EUC encoding. It is equivalent to the CN-GB MIME charset defined in RFC 1922.
This is equivalent to the Shift_JIS Internet charset. Each character is encoded by a variable length sequence of octets. This is Microsoft's standard encoding for Japanese.
This is equivalent to the CN-Big5 MIME charset defined in RFC 1922.
n can be any single digit other than 0. Each character in the repertoire of ISO 8859-n is represented by a single byte.
On input, this uses XML's rules to determine the encoding. On output, this uses UTF-8.
Specify this encoding when a storage object is encoded using your system's default Windows character set. This uses the so-called ANSI code page.
This uses the unicode encoding if the storage object starts with a byte order mark and otherwise the windows encoding. If you are working with Unicode, this is probably the best value for SP_ENCODING.
Specify this encoding when a storage object (file) uses the OEM code page. The OEM code-page for a particular machine is the code-page used by FAT file-systems on that machine and is the default code-page for MS-DOS consoles.
The range of predefined storage managers can be extended by user-written storage managers to implement new classes of storage objects.
For example, an ftp storage manager could be defined to
retrieve documents from an ftp server. After implementing an ftp
storage manager with a name of FTP, entities could be loaded from an
ftp server by using a formal system identifier like
<FTP>ftp.epc.de/sample1.xml.
To be able to use formal system identifiers that refer to user defined storage managers, the following steps have to be performed:
User defined storage managers are implemented as Tcl scripts that are called by the parser when an entity with the appropriate formal system identifier is loaded. Basically, a storage manager is a Tcl proedure that is called with a varying number of arguments.
{}
proc xyz { cmd arg1 args } {
# storage manager implementation
}
The first argument cmd that is passed to a storage manager
procedure is a command keyword that specifies the required
operation. The number and the semantics of the remaining arguments
depend on the command keyword cmd. The following commands must
be supported by every storage manager:
Typically, a storage manager will implement switch construct to handle the different command keywords.
Once a storage manager has been implemented, it must be registered with the sgml widget to be useable.
It is possible to replace the builtin storage managers (with the exception osfile) by user defined storage managers.