SGML Reference Doc

Subject: SGML Reference Doc
From: Chet Ensign <DOCCOE -at- IBIVM -dot- IBMMAIL -dot- COM>
Date: Tue, 26 Apr 1994 12:58:54 EDT

Laura M. Myott writes:

>I have been searching the WWW for a reference document on SGML. For my TC
>internship this semester, I have to convert some documents from HTML to
\SGML and I have yet to find anything on the Web which gives the sytax
>for SGML documents.

Laura,

HTML *is* SGML. To be more specific, it is one implementation of SGML.
SGML is simply a meta-language for defining specific markup languages for
documents... although the word "document" is a very broad category here.
For our purposes, "document" does not equal "book".

I'll append a "what is SGML in 25 words or less" to the end of this
message. But to answer your first couple of questions, there's no SGML
spec that I know of anywhere on the internet. The ISO standard itself
is, of course, for sale from the ISO and, being a legal document, it is
not especially enlightening. *The* technical reference, as far as I'm
concerned, is "The SGML Handbook" by Dr. Charles Goldfarb, published by
Oxford University Press. It is the ISO standard, liberally reorganized and
annotated by Dr. Goldfarb, the primary author of SGML. It is $95, so maybe
you want to insist that your library stock a copy of it.

As to HTML docs, try these:

1> http://www-external.hal.com/connolly/html-design.html. It's Dan Connolly's
notebook on the design of a successor for HTML. It's got pointers to lots of
related stuff.

2> http://info.cern.ch/hypertext/WWW/MarkUp/HTML.dtd.html
and
http://info.cern.ch/hypertext/WWW/MarkUp/htmlplusdtd.txt

I have a recent draft of the HTML specs that came from this repository. It
explains all the HTML tags.

The rest of this message is a brief, general overview of SGML itself. Hope
this is useful to you.



Subject: Overview of SGML, from a posting to comp-text-sgml


SGML is not a markup language nor is it a product. You can not buy SGML. SGML
is an international standard (ISO 8879) which many vendors (Arbortext,
Avalanche, EBT, Exoterica, Frame, Interleaf, SoftQuad, WordPerfect...) are
using to develop or enhance text processing products.

Think of it this way. SGML does for text what relational theory and SQL did
for regular data over a decade ago (and we always have to give Tim Bray of
Open Text a nod when we say this -- he made the point in a wonderful
presentation at SGML '92). It provides us with a method -- a framework -- for
describing (and marking up) text that is independent of the program that
creates it.

Here now. Take the markup language of your choice -- Rich Text Format, LaTEX,
Script, you name it -- and what have you got? What you've got is a language
that tells a specific product how to make text look the way you want it to
look. Different word processor, different platform -- so sorry. Send us ASCII
instead. We don't understand all those formatting codes of yours.

But what is format, anyway? It is IMPLIED visual information about the
organization and structure that you have put the text into. In fact, one
simple way to define "document" is that it is text and graphics contained in
some sort of organizing framework -- chapter, division, subdivision, subsub-
divison, topic, or what-have-you. All those visual cues that the GUI word
processors make it so easy to paint on -- different fonts, indented margins,
pretty drop shadow boxes -- they are all meant to help us interpret, as we
read what each chunk of text is supposed to be. "Oh, this big font here means
I've started a new section. Oh, that stuff inside the box must be a screen
shot. "

SGML says -- why imply it? That's fine for the printout and the reader, but
why make the markup behind that display be as ambiguous as the display itself.
Ambiguous? Well, I mean, a lot of stuff gets printed italic in my manuals, but
the italicized stuff is not all the same kind of thing. Some of it is book
titles, some is system keywords, some of it is "hey, pay attention to me"
emphasized phrases. Why not have the underlying markup say "This is a book
title, this is a keyword, and this over here is the writer waving at you."
Then, different output systems can display them whatever way is best. Make it
italic on paper, bold on my mainframe and ignore it completely when it goes to
comp.text.sgml.

This is called "generic coding." You say what a piece of text is instead of
what it is supposed to look like, by defining the elements that can go into
your documents and using their names as tags to show where each starts and
ends. Lets say you define paragraphs to be <para> and bulleted lists to be
<listbul> and items in the list to be <it>. Then, instead of having perfectly
plain, eminently readable, ever so exchangeable markup and text like:

****************************

\pard\plain \s254\sb120 \b\f18 {\f21 Chapter Divider Page\par }\pard\plain \f3
The chapter divider page can be completely derived by the system by extracting
the following information.\par
\pard\plain \s253\li360 \b\f3 chapter #\par
chapter title\par
graphic - logo\par
graphic - sequence marker\par
\pard\plain \f3 \par
}\pard\plain \f3 Chapter table of contents can be completely derived by the
system by extracting the following information.\par

****************************

You have something a bit less dense like:


****************************
<sect><title>Chapter Divider Page</title>
*** and note that the / means it's the end of the named element ***
<para>The chapter divider page can be completely derived by the system by
extracting the following information.</para>
<listbul>
<it>chapter #</it>
<it>chapter title</it>
<it>graphic - logo </it>
<it>graphic - sequence marker</it>
</listbul>
<para>Chapter table of contents can be completely derived by the system by
extracting the following information.</para>

****************************

Which I maintain is a big improvement.

Where do you define the elements? You write a data definition called a
Document Type Definition (DTD). The DTD does a lot of work in SGML, but the
lion's share of it is a collection of statements that each declare an element
that can occur in the document and what stuff -- text, other elements -- the
element can contain. It is equivalent to the data definition you write for a
database.

Now you may reply "This is all very well and good but so what? I don't have to
look at those codes anyway." And you're right. You don't. But your computer
sure does. And what your computer looks at and what it can do with it goes a
long way towards determining what you are going to be able to do with that
document you see.

That DTD gives systems a lot of control and leverage over documents. The fact
that an explicit declaration of the construction of your document exists means
that programs can be written to compare a document with its DTD and tell you
whether it is coded correctly or, if not, give you meaningful error messages
about what is tagged improperly. This gets referred to as "parsing."

Have you ever run into the situation where you get a file from somebody else
and you merge it into the middle of the document you are writing and then
spend the rest of the day trying to figure out why, from that point on, your
margins are whacked and fonts keep going to Courier on the printout and etc.
etc. ? I am a writer and, believe me, I have. In word processed files, the
underlying markup can get so complex, so intricate, so idiosyncratic, that
there is no way to check whether what you are getting is going to be
compatible with what you've done but to dump it in and see what happens. Most
of the time, it works. But when it doesn't, you get to spend a couple of hours
on the phone with Microsoft support and, well, you get the idea.

Parsing is one of the aspects of SGML that makes it very attractive to me. I'm
supposed to corral the creative proclivities of 20 writers inside my
department and another 20 or so scattered across the company and me it
possible for them all to work as a team on large, collaboratively produced
documents so that they come together like a jigsaw puzzle, with nice clean
clicks and snaps, not with gurgles and burps and downtime while the lead
writer calls Microsoft. The prospect of a system that controls the underlying
markup of these documents and error checks a contribution BEFORE that
contribution brings the document to its knees I view with a fervor normally
reserved for a long weekend in Aruba.

All this means nothing if there are no products on the market that support the
standard. But those products have started arriving. Companies like SoftQuad,
Arbortext and Datalogics that have been putting together SGML based publishing
systems for sometime. Interleaf, Frame and others are just now releasing
versions of their composition systems that (to some degree or another) take
advantage of SGML. And many of the vendors recently banded together to form
SGML Open, a consortium intended to help them make SGML products work more
seamlessly together.

Best regards,

/chet

--
Chet Ensign
Information Builders, Inc. 212-736-6250 X4349
1250 Broadway
New York, NY 10001

internet: doccoe -at- ibivm -dot- ibmmail -dot- com
ibmmail: USUBUVMV -at- IBMMAIL
compuserve: 73163,1414


Previous by Author: Summaries of SGML Forum meetings
Next by Author: Re: Interviewing for info. gathering
Previous by Thread: Re: Bryan's Dilemmas
Next by Thread: os/2 docs


What this post helpful? Share it with friends and colleagues:


Sponsored Ads