Re: CaveXML>Misc. Discussion>Data model, Concepts

New Message Reply About this list Date view Thread view Subject view Author view

From: Martin Heller (heller_at_geo.unizh.ch)
Date: Mon Feb 12 2001 - 16:49:09 CET


Received: (from mdom_at_localhost) by karto.ethz.ch (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id QAA08739 for cavexml-outgoing; Mon, 12 Feb 2001 16:48:57 +0100
Received: from norwegia.access.ch (mail.access.ch [195.112.71.134]) by karto.ethz.ch (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id QAA08735 for <cavexml_at_cartography.ch>; Mon, 12 Feb 2001 16:48:57 +0100
Received: from [212.161.128.118] (megapop2-54.access.ch [212.161.128.118]) by norwegia.access.ch (8.9.3/8.9.3) with ESMTP id QAA27856 for <cavexml_at_cartography.ch>; Mon, 12 Feb 2001 16:48:57 +0100
X-Sender: heller_at_mail.geo.unizh.ch (Unverified)
Message-Id: <p05010400b6adad18876b@[212.161.130.46]>
Date: Mon, 12 Feb 2001 16:49:09 +0100
To: cavexml_at_cartography.ch
From: Martin Heller <heller_at_geo.unizh.ch>
Subject: Re: CaveXML>Misc. Discussion>Data model, Concepts
Content-Type: text/plain; charset="us-ascii" ; format="flowed"
Sender: owner-cavexml_at_karto.baug.ethz.ch
Precedence: bulk
Reply-To: cavexml_at_cartography.ch


>I sent this message a couple days ago but it seems to have been lost in
> all the chatter. At least one person confessed to not having seen it come
> across. I'll submit it again in case anyone cares.

> DSK

Hello,

no, your message was not lost or ignored. It's just that answering in
a foreign language takes some time :-)

I would like to take the opportunity to join the discussion. I have
been following it closely, but other - less important, but more
urgent - things (like work) and recovering from serious
back-problems have kept me from participating directly.

I am glad, that you agree with some of the issues that I raised with
Andreas. And that you, too, suggest discussing the concepts and
entities of a common data model.

I am as excited about the possibilities of XML as you are and hope
that - this time - the time is right to succeed in creating a common
exchange format. There seems to be a critical mass of people in this
group, that combine a lot of knowledge. Unfortunately, previous
attempts failed despite great expertise, effort and enthusiasm. We
should try not to make the same mistakes, again.

One reason was definitely that the problem is complex. More complex
than it seems at first sight.

There are conceptual differences between the existing data formats.
Some of them are subtle, but a few are quite fundamental.
The differences stem only partially from varying design priorities,
goals and decisions in software engineering. They mainly result from
the diverging styles the cave-surveyors prefer. The programs reflect
'cultural' diversity: there exist different survey and modeling
approaches. There is a lot of similarity between many 'American'
programs because there seems to be an 'American' school of thought.
American cavers survey and draw maps in a certain way that over time
went in a different direction than the European way, for instance.
Well, this is a simplification: there is definitely no European way.
If you try to handle data from different countries, you will find
major discrepancies. Cavers are individualists - and it shows. It
would have been very difficult to write a cave survey program for all
needs. Each existing program was created to best suit a particular
style: the way the local cavers wanted to survey and had always
surveyed. I have never tried to advocate Toporobot in other countries
because it does only support the approach well, it was originally
meant for. Nevertheless, many people use it now worldwide, and have
adopted the 'Toporobot' way. Others wanted to use it and complained
bitterly, that it did not work for them. Somehow like a German
spelling checker for English text. Or a Java interpreter for Assembly
Language.

Over the years, I have written many programs or Perl scripts to
import data from various sources (many extinct programs or
spreadsheets, and later AGHPlan, CAD fuer Hoehlen, Compass, Smaps,
OnStation, Survex, CaveRender etc.) e.g. Hoelloch (1979),
Flintridge-Mammoth (1982), Lechuguilla (1995), and various major
cave-systems in Europe. The resulting scripts were typically not
general enough to support the entire feature-set of the foreign
format, just enough that everything was sufficient for the particular
transfer. Most often, the scripts had to be adjusted for each
data-set, as the data format had evolved and was always a little
different. This will definitely improve with XML.

But, the main problem was not the syntax, nor the structure, it was
the substance. There were many incompatibilities between the data
models. Parsing was fairly easy; converting was not.

The main difficulty were different forms of groupings; how the
primitive elements (stations, shots, fix points, sections etc.) were
structured. So, just because we can write now in a common alphabet
(e.g. XML), we still need to find a common vocabulary and grammar.

In this process we should not favor one particular dialect: the
'natural' format for one culture may be very clumsy for the other.

Calling another approach 'impossible' or 'bad software engineering'
demonstrates some lack of cross-cultural understanding :-)
Particularly, if judgments are based on hear-say. Some of the
differences may be considerable, but it does not help exaggerating
them by believing in myths. I have to admit, that part of it is my
fault by not having translated the Toporobot documentation to English
and thus allowing myths to grow. I will change that. Soon. In my next
postings, I will try to explain the rationale behind 'series'. You
will see that 'series' and 'surveys' can co-exist peacefully :-)

The statement that Toporobot or CaveRender cannot handle
station-labels is definitely false. You can enter data in CaveRender
the way you are used to, but also in the 'Toporobot' way. It would
not have been possible to import data from Lechuguilla, Flintridge...
if Toporobot could indeed not handle 'station-labels', 'surveys' or
hierarchies as offered by Survex or Smaps. Such hierarchies just do
not correspond to the native organization of Toporobot and need
additional structuring into series. This structuring can be done
automatically by Toporobot, but is typically done semi-automatically
to improve the results. The users gradually include marked-up
comments in the original data to steer the structuring. Already
during this process, the data can be used in the original program as
well as in Toporobot and data can be exchanged back and forth.

This is not new. We have been doing this for many years for instance
between Compass and Toporobot. In 1993 I introduced an exchange
format to facilitate such conversions. It was first nicknamed 'NSS'
(no sorted series) and later named TCD (the common denomiator, or
toporobot cave data) format. This allowed structuring with minimal
mark-up and effort. All my Perl scripts translate foreign formats to
and from TCD. As all the structuring is done in Toporobot, the
scripts are not complex and quite small (a few hundred lines for
CaverRender, CfH, Compass etc.)

One helpful design issue in TCD was to unfold the hierarchies in a
flat file. While this may sound strange at first, this is was a
necessity, because there is not one but many ways the existing cave
data is ordered or hierarchically structured: e.g. by

time (discovery, surveying, editing chronology)
documentation (line, page, book, library)
station-label (name-space; naming conventions)
organization (surveyor, sketchier, editor, team, club, federation)
space (part, cave, cave system, karstological, hydrological, political regions)
maps (map-sheet, atlas)
morphology (passages, shafts, halls)
model units (series, polygons, polyhedral-meshes, NURBS)
toponymie (location-names)
versions (states, re-surveys, multiple measurements, corrections)
survey-methods (technique, instruments, units, error-estimates)
processing-instructions (loop closing order or hierarchy, model building hints)

And of course the geometrical data can be referenced by an infinite
number of applications:
maps, models, trip-reports, bibliography, scientic studies (e.g.
phreatic passages, passages in the same strata, passages developed
during the same phase, at the same level, or with similar
cross-sections; where specimens were found and so on...)

Note: this list is not in a particular order or implies a prefered
hierarchy. No hierarchy is naturally privileged.

As a consequence, the data is grouped in multiple, intertwining
hierarchies. Attempts to represent this in one tree must lead to
complex and confusing solutions.

It seems more elegant to describe this in a relational model by
defining relatively simple primitives (stations, shots, sections,
series, trips, methods, folders, etc.) and relate and group them by
references or associations.

It is quite straightforward to flatten multiple hierarchies. The
principal hierarchies are represented by direct references to parent
objects (e.g. father, mother etc. as in a genealogical family graph)
and all ad-hoc hierarchies by associating objects to groups through
attribute tables. The original or prefered hierarchy/ies, shown as a
nested tree, can be generated when needed with equal ease.

Keeping a relational model for TCD proved to be highly beneficial. It
allowed data to be stored in any data-base or spreadsheet. Of course,
TCD was not meant to be edited manually; it was conceived to be used
for long-term archiving and as an intermediate file to facilitate
conversion.

In 1998 I started experiments re-formulating the TCD content in XML.
This made the format less cryptic and more extensible. I was able to
base this on a much earlier attempt: SGML was the native Toporobot
format in 1982. When I had switched to home-computers, I had to give
up this approach due to lack of appropriate tools and because users
were demanding a format more accessible by spreadsheets and
databases. But of course I come back gladly, now :-)

Toporobot exports various XperiMentaL formats. All contain the same
content as TCD but present it in different views and styles. All
formats are still fine-tuned for Toporobot users. They are not suited
for general use, as TCD and the derived CDX only contain elements
that can currently be handled by Toporobot. But I plan to extend CDX
considerably and offer infra-structure in Toporobot in order to
change that. All that Toporobot cannot handle will be considered a
meta-comment. It will be ignored, but stored and exported within the
proper context.

As Andreas reported, I have been designing a version of CDX that
would be the recommended format to exchange data with Toporobot and
be the native Toporobot format for archiving. It will also be the
best way to describe its data model. I should be able soon to send
examples of XML, DTD, and Schema to interested parties. You will see
that I currently produce two styles. One is closer to the user and is
based on station-labels. The other is more abstract and refers to
objects by their id. (e.g. station names are merely attributes). Not
only stations but each object has a required id. Both representations
describe the same data model and can easily be transformed into each
other. A user will probably prefer the first, a program can import
the second flavor more easily.

I have deliberately refrained from using too many of the features
offered by XML. Many techniques e.g. XSL and XLL are still rapidly
evolving. Fundamentally depending on them would currently require
permanent implementation-changes. And my current goal is not a format
for eternity. I am prepared to totally redesign instead of extending
it when I have gained enough experience. While the format may be
temporary, I need it now to have the Toporobot data in a stable and
consistent format. This will simplify transforming it to any current
or upcoming format.

Similarly, it might be reasonable to develop different XML formats in
parallel first, one for each major 'culture'. In this process we can
learn a lot from each other about the data models and also help each
other mastering the emerging new technologies. Each initiative would
be targeted towards particular local needs and prerequisites and be
based on at least one existing program. Groups would influence
themselves from the beginning and would borrow features or styles
whenever applicable. When the local formats mature, become more and
more comprehensive, and have comparable content, we can start merging
them into one standard. This should be possible, as the different
approaches are done in collaboration and to complement.

I believe strongly in CaveXML as an important long-term goal, but
think it would be more realistic to start with an ASCI (American
Speleologists Cave-data Interchange) and some ISO (Interregional
Speleological Organisations) Roman, Central European etc. formats.
Later we can attempt the Unicode (universally needed international
cavers organisation data exchange) format.

cheers

Martin


New Message Reply About this list Date view Thread view Subject view Author view

This archive was generated by hypermail 2b30 : Thu Mar 01 2001 - 18:00:01 CET