TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Below is a very preliminary analysis I was asked to perform for someone
interested in converting unstructured FM docs to structured FM+SGML docs
that conform to the Docbook DTD. I am posting it on the TECHWR, FrameSGML,
and framers lists because I believe it may be of general interest. I invite
comments, particularly if you disagree with anything stated herein.
==============================================
I have examined the FM document you sent. It appears to be consistently
tagged. The paragraph and character tagging scheme is quite simple, and
reflects a relatively small number of document object types (e.g., body
text, bulleted list, datafile). The Docbook DTD defines approximately 120
different elements. My own opinion is that Docbook is the DTD from hell,
and should be avoided at all costs unless you are being forced to use it.
1. CONVERSION TO STRUCTURED FM+SGML DOCUMENTS
Obviously, any kind of automated conversion method to go from FM
unstructured to FM+SGML structured in conformance with a DTD requires that
the paragraph and character tags be unambiguously mappable to applicable
elements in the DTD. Furthermore, there is no way that attribute values
for elements in the resulting converted docs can be properly assigned
(i.e., all values would be initialized to their DTD-specified default
value, if any).
FM+SGML has a built-in capability to convert unstructured docs to
structured ones, using structure rule tables to map the various tagged
document objects in the unstructured doc to the corresponding SGML
elements. When there is a 1:1 relationship (as opposed to a 1 to many or
many to 1) of each tagged object to a corresponding SGML element, structure
rule tables can do a fairly good job, however manual cleanup work is
inevitable to make the converted document fully conformant to the DTD/EDD,
and to apply the appropriate attribute values.
I conclude that your documents probably do not fit well with the above
conversion requirements, particularly for conversion to a DTD/EDD as
complex as Docbook, however, a more thorough analysis might show otherwise,
particularly if you decide to develop your own DTD/EDD whose structure
closely resembles that of your existing documents.
There is one additional requirement that must be met for unstructured to
structured conversions to be possible: The entire FM document must have a
single text flow.
Obviously, once you've converted an unstructured FM document to a
structured FM+SGML one, you never again want to revert back to the
unstructured one for editing or anything else. After conversion, you should
discard the original (first verifying, of course, that everything was
properly converted).
2. VERSION CONTROL
You mention keeping the content of these documents (in .txt or .mif format)
in a CVS. Clearly, storing .txt or .mif is not the answer. Instead, you
should export the documents from FM+SGML to XML and store that.
XML has many new features (including Unicode) that make it superior to SGML
(and certainly cosmically better than ASCII text or MIF) for database
storage. Storage in this form has the added advantage of allowing you to
maintain revision/version control at any desired level of granularity,
because the proper kind of database repository can parse the document into
its individual components (i.e., elements and external entities (e.g.,
graphics)), maintain revision/version information on each component, and
retrieve any desired portion of any desired version.
A CVS/data repository that stores XML can become the sole source of
controlled documents for an entire enterprise. Information is retrieved
from the database by human and non-human queries. Middleware (e.g.,
Omnimark) is used to process the information extracted by these queries to
match the requirements specified by the users. XSL style sheets (also part
of the XML standard) can be created by the middleware to format the
information when it is viewed in an XML-aware browser.
3. ROUND-TRIPPING BETWEEN THE CVS AND FM_SGML
Ideally, you would originate, revise, and edit your structured documents in
the WYSIWYG environment of FM+SGML, export them as XML for storage in the
database repository, and check the documents (or any portion thereof )
directly out of the database into FM+SGML for incorporating changes, as
well as for printing them or converting them to PDF or other formats.
However, XML round-tripping is not possible because FM+SGML (including the
new 6.0 version) can export XML but cannot import it. Consequently, if you
export your documents as XML for storage in the database, you'll have to
use a middleware product like OmniMark to convert the XML. document
instances to SGML before they can be imported into FM+SGML. This
conversion from XML to SGML also requires that Unicode characters with ANSI
numbers above 127 (as well as any other non-english characters), be
converted to their equivalent ISO character set entity references, since
FM+SGML cannot process Unicode input.
It is extremely unfortunate that FM+SGML (including the new version 6.0)
does not implement Unicode. If Unicode had been fully implemented, it
would have been possible to use multi-language Unicode fonts with FM+SGML,
which would have greatly facilitated language translations, including the
intermixing of two or more languages in the same document. The intermixed
languages would have been fully preserved on export to, or import from, XML.
4. LINK PROBLEMS
Another problem is links (i.e., cross-references and hypertext links).
FrameMaker implements cross-reference links using ID and IDREF attributes
which conform to the SGML standard. This is OK when all such links are
internal to the exported SGML document instance, but external
cross-references created in FM+SGML do not produce links that work when
the document is exported to SGML, because FM+SGML, on export, cannot
produce an IDREF attribute value that includes the location of the external
file (This is a limitation of SGML). To make it worse, neither the internal
nor the external cross-references work if the document is exported to XML,
because links in XML are implemented differently, as specified in the XLink
and XPointer portions of the XML standard. You could create XML-conformant
equivalents of the ID and IDREF attributes in the FM+SGML EDD (and the
corresponding DTD), however, these attribute values, unlike FM
cross-references, will have to be manually entered for the elements at
each end of each link, and the links will not work in FM+SGML..
5. FORMATTING
You also mentioned that it would be nice to be able to preserve the "look
and feel" of the existing unstructured documents after they've been
converted to structured documents. This is where FM+SGML really shines. All
of the formatting specifications are defined in the EDD and its companion
template. Consequently, you can make the converted FM+SGML documents
closely resemble the formatting of the current documents. Also, when you
import an XML or SGML document instance into FM+SGML, the formatting
specified in the EDD is applied.
When you export an XML document instance, you can also produce a Cascading
Style Sheet (CSS) that is derived from the formatting specifications in the
EDD and its companion template. Thus, if you open an exported XML document
instance with a CSS in an XML-aware browser such as IE5, the formatting
(but not necessarily the layout) of the original document will be replicated.
CONCLUSION
As you can see, conversion to structured FM+SGML documents is not a trivial
undertaking, and the full utilization of all the benefits that can be
derived therefrom is made difficult by some of FM+SGML's current
limitations. The initial investment is high, but if your operation is large
enough, the savings possible in areas such as author productivity, document
quality assurance, revision control, information reuse, and information
repurposing will pay back those costs many times over.
====================
| Nullius in Verba |
====================
Dan Emory, Dan Emory & Associates
FrameMaker/FrameMaker+SGML Document Design & Database Publishing
Voice/Fax: 949-722-8971 E-Mail: danemory -at- primenet -dot- com
10044 Adams Ave. #208, Huntington Beach, CA 92646
---Subscribe to the "Free Framers" list by sending a message to
majordomo -at- omsys -dot- com with "subscribe framers" (no quotes) in the body.