Are You Converting Publications to Web Pages for a BIG Site? (long post)

Subject: Are You Converting Publications to Web Pages for a BIG Site? (long post)
From: Chet Ensign <Chet_Ensign%LDS -at- NOTES -dot- WORLDCOM -dot- COM>
Date: Wed, 21 Jun 1995 10:34:26 EDT

Stephanie Goble started a very thought provoking thread when she asked...

>Is anyone else working for a company that is putting their entire catalog
>of public and private documents into a Web-accessible format?

...along with all the attendant issues she raised. Reading all the responses
and arranging them into the mental space I've allocated to the topic "HTML;
Electronic Publishing Heaven or Hell?" -- and having just accidentally hit the
'Send' button, sending an enormous chunk of collected stuff to the list (sorry
everybody), its time to put these thoughts into shape and submit them to the
dialog.

HTML, HTTP, URLs and Web browsers have been the biggest boons to electronic
publishing since the computer itself. The Web has neatly resolved the problem
of delivering information to many different platforms and operating systems by
using a client/server model. The Web is so simple to implement that a whole new
generation of writers and publishers have jumped on board with next to no
technophobia. And the Web has changed what was a niche in the technical writing
field into a popular phenomenon. So in all that it has been very good.

But I am afraid that it is also leading us into a sweet trap. In these early
stages of its developement, we either don't experience the problems or we can
easily shrug them off. But in truth, HTML, the server architecture itself, and
proliferating field of browsers have the potential to complicate our lives
enormously.

I'll describe the developing landscape as I see it and throw out the one idea,
rough as it is, for traversing it safely. This is a long posting, but I hope
that the issues it raises help us build better systems for our BIG Web sites.
And the good ones are almost all going to be BIG.


Issue: HTML is a Moving Target

In the beginning, HTML was a nice, clean little language. It did not have to be
too difficult because it was mainly a mechanism to allow physicists to
collaborate on their research. The kinds of text they wanted to share were
simple research memos and papers and such, so the markup language did not need
to be too difficult. In fact, it *couldn't* be too difficult, since most
researchers would be coding their papers in simple text editors. The simpler
the markup the better.

Nobody expected this. Nobody expected HTML, the Web and Mosaic to become the
backbone for interactive advertising and electronic commerce.

But they are responding to the challenge. W3O has specified HTML 2.0 and they
are working hard on HTML 3.0. The Three level will specify some hotly desired
extensions as tables and stylesheets, but the final shape of those schemes is
still a year or two away from being finalized. Some browser makers, in the
meantime, not satisfied to wait on an orderly process of development have built
their browsers to handle codes that either have nothing at all to do with the
HTML specs, or that reflect their best guess at how the final specs will
emerge.

And in the meantime, we are creating HTML today and we will continue to create
HTML tomorrow. Some of us have been creating HTML for awhile. Our sites are
growing in size and scope, and now, some of the larger sites are starting to
recognize that there's a minor problem here. Their HTML archives have reached
megabytes in size and the HTML in it is all over the place: some of it is in
HTML 2.0, some in older versions of HTML, some uses proprietary codes, some
goes back to the days when there were no guidelines at all and people were
coding to the 'whatever works' standard. RFPs that we get now reflect this
problem, more and more of them explicitly ask how the markup will be kept valid
as HTML evolves.

Additional wrinkle: too many writers, frustrated by their inability to control
the final appearance of an HTML file, resort to 'tricks.' Tricky coding to make
text sit next to pictures, or to indent text a particular way , or to get just
the right kind of coding. The tricks work ... for now. But, just as programmers
who use undocumented calls can't rely on their program working with the next
release of Windows, there is no guarantee that their documents will behave
nicely in tomorrow's Web environment.

HTML is evolving and it is evolving very, very fast. We can only be sure of one
thing. By coding in today's HTML, we are creating somebody's upgrade problem
for tomorrow.


Issue: The Web server architecture is fragile

The organization of information on a Web server is basically a file system. It
is a tree of linked files. This is a pretty fragile architecture to substitute
for a robust document management system. Yet that's what we've got. Most of us
encounter the problems already on our network file server. I don't think making
the same kind of system publicly available is going to magically cure its
limitations.

Like real trees, file systems grow. Especially if we don't have a tight gateway
and clear cut policies governing who puts what where on the Web site. In short
order, any interesting Web site is going to turn into... well, into a web.
Which is what we want it to be, only we don't want to have to become spiders to
hold it together.

There are absolutely no link management controls in this environment. How do
you proof the thing? What are the implications, for example, once people start
to yank pieces out? Or change it. If, for example, this Web page is a step on
the line from point A to point B, and I don't like something about it, so I
rewrite it, forgetting, as I do, that there was a link to B in the original?
I've done more than broken a link. I've potentially closed off a pathway to
other information at the site. This is made more complicated on the Web by the
global scope of the system -- the very feature that makes the Web so attractive
in the first place. Links and presence now matter outside the scope of our own
organization. Removing information potentially breaks links from other
locations as well.

There is also the issue of 'chunking.' I can imagine the reaction of the
writers here if I told them; "Oh, and by the way, from now on, you have to
store every section of your document in a separate file." They go banannas on
me, and rightly so. It is an arbitrary constraint meant to serve a specialized
need. It should not be a constraint imposed on writers. Yet, on a Web server,
we do have to break down information into modest sized files simply to optimise
the system for readers. I'm not suggesting that the Web should be different, or
that the designers of Web sites should not consider those parameters. I'm just
saying that it should not be an artifical constraint on writers.


Issue: Browsers are differentiating

Web browsers have become commercial products and commerical products don't make
their money being the same as everybody else. Thank goodness. They make their
money by being different and by pushing the limits of what is possible. For the
Web reader, this is a good thing. The material out there is going to become
richer, more interesting and more engaging. I don't know about you, but I can't
wait to start playing with VRML browsers.

For the Web site developer, however, a landscape of differentiating browsers
poses problems. We are already seeing Web sites labelled as 'Coded for
Netscape' or some such. Looks great in the latest version of Netscape; looks
so-so to downright scrambled in everything else. Believe me; this way lies
madness!

Of course, many people say "Well hey, Netscape has the lion's share of the
market." True today. Maybe not tomorrow. Windows 95 is coming out with...
Spyglass Mosaic. Talk about changing the balance of power! Then what do you do
with all those megs of hard-coded Netscape pages you've got.

The truth is, browsers will be different. As time progresses (and remember, we
are talking about an extremely foreshortened version of 'time' here... weeks
and months, not years!) browsers will be released with new features, unique
capabilities and we, as Web site developers, will want to take advantage of
them. But not at the expense of locking ourselves out of future capabilities.
One of the true beauties of the Internet and the Web is that they are based on
open standards. Open standards are what has made the Web the substantial
success that it is. Why on earth would we want to throw that away?!


Solution: Generate the code that best satisfies the browser

I see only one way out of the trap and I'm not even sure how to recommend we
accomplish it.

The conclusion I draw from all these thoughts is that HTML is a terrific
language for *delivering* the information but a disasterous language for
*storing* it. The solution is to store information in a predictable, reliable
form, one where you can be sure that the same kinds of text content in the
repository will always be identified the same way, and then generate the
optimum HTML when it is requested.

Some projects are already doing this. They have information in databases of one
type or another.
They have systems sitting between the Web and the database. When the Web server
passes a request for a URL to the system, the system uses it to extract all the
data components from the database (text, graphics, etc.) and dynamically tags
them with HTML before sending this virtual Web page back down the pipeline.

This can solve the browser problem. Many browsers already identify themselves
in the HTTP packet. If the system is coded so that it looks for the identity of
the browser first, then it can delivery whatever HTML works best for that
browser -- even down to inserting those little 'tricks' to make the final
presentation look nice.

This can solve the HTML problem, because as HTML evolves, you change your
processing code, not your data. You are spared having to pick between choosing
the lowest common denominator HTML, or excluding a whole class of browser users
who don't use your 'browser of choice.'

This can even solve the 'chunking' problem, because optimising the delivered
file size can again be handled by the system. Imagine that you are sending a
long legal treatise back down the pipeline and the 'section' of text is pretty
big. The system could simply be set up to send it in managable chunks, with a
'More' anchor tag at the end of the chunk. If the reader wants to keep reading,
they just click the button and continue on.

All this takes programming and preparation and I can't tell you (alas) that I
have just the package in hand. Even if I did, that isn't why I wrote this. I
wrote this to encourage us all to share ideas. I'm sure that there are options
beyond what I know for accomplishing this and since we are going to be living
with this stuff for years, it is worth kicking a few of them around.

Best regards,

/chet

Chet Ensign
Director of Electronic Documentation
Logical Design Solutions
571 Central Avenue http://www.lds.com
Murray Hill, NJ 07974 censign -at- lds -dot- com [email]
908-771-9221 [Phone] 908-771-0430 [FAX]


Previous by Author: -No Subject-
Next by Author: Re: SGML structure clarification
Previous by Thread: Re: Re. Interleaf difficult?
Next by Thread: Re: PageMaker Pitfall


What this post helpful? Share it with friends and colleagues:


Sponsored Ads