Types of data sets

New Message Reply About this list Date view Thread view Subject view Author view

From: Peter MATTHEWS (matthews_at_melbpc.org.au)
Date: Wed Jul 17 2002 - 03:41:50 CEST


Return-Path: <owner-cavexml-outgoing_at_ethz.ch>
Delivered-To: cavexml-archive_at_cartography.ch
Received: from localhost (localhost [127.0.0.1]) by karmail.ethz.ch (Postfix on SuSE eMail Server 2.0) with ESMTP id 2A68214AB1 for <cavexml-outgoing_at_ethz.ch>; Wed, 17 Jul 2002 03:57:13 +0200 (CEST)
Received: by karmail.ethz.ch (Postfix on SuSE eMail Server 2.0, from userid 28) id F29CB8A77; Wed, 17 Jul 2002 03:57:10 +0200 (CEST)
Delivered-To: cavexml-loopcheck_at_ethz.ch
Received: from localhost (localhost [127.0.0.1]) by karmail.ethz.ch (Postfix on SuSE eMail Server 2.0) with ESMTP id AE73514AB1 for <cavexml-loopcheck_at_ethz.ch>; Wed, 17 Jul 2002 03:57:09 +0200 (CEST)
Received: by karmail.ethz.ch (Postfix on SuSE eMail Server 2.0, from userid 96) id 7F26514A01; Wed, 17 Jul 2002 03:57:07 +0200 (CEST)
Delivered-To: cavexml_at_cartography.ch
Received: from localhost (localhost [127.0.0.1]) by karmail.ethz.ch (Postfix on SuSE eMail Server 2.0) with ESMTP id 1239914AB1 for <cavexml_at_cartography.ch>; Wed, 17 Jul 2002 03:57:07 +0200 (CEST)
Received: from relay1.melbpc.org.au (newglider.melbpc.org.au [203.12.152.9]) by karmail.ethz.ch (Postfix on SuSE eMail Server 2.0) with ESMTP id ACC6F8A77 for <cavexml_at_cartography.ch>; Wed, 17 Jul 2002 03:57:02 +0200 (CEST)
Received: from localhost.melbpc.org.au (localhost.melbpc.org.au [127.0.0.1]) by relay1.melbpc.org.au (8.12.3/8.11.6) with ESMTP id g6H1gQQO013504 for <cavexml_at_cartography.ch>; Wed, 17 Jul 2002 11:42:26 +1000 (EST) (envelope-from matthews_at_melbpc.org.au)
Content-Type: text/plain; charset="us-ascii"; format=flowed
Date: Wed, 17 Jul 2002 11:41:50 +1000
From: Peter MATTHEWS <matthews_at_melbpc.org.au>
In-Reply-To: <Pine.LNX.4.44.0207080213330.14089-100000_at_ares.its.yale.edu >
Message-Id: <5.1.0.14.1.20020716194200.01d62820@popa.melbpc.org.au>
Received: from relay1.melbpc.org.au (localhost.melbpc.org.au [127.0.0.1]) by localhost.melbpc.org.au (AvMailGate-2.0.0.6) id 12951-108002D0; Wed, 17 Jul 2002 11:41:57 +1000
Received: from peter.melbpc.org.au (a1-94.melbpc.org.au [203.12.158.94]) by relay1.melbpc.org.au (8.12.3/8.11.6) with ESMTP id g6H1fqso012862 for <cavexml_at_cartography.ch>; Wed, 17 Jul 2002 11:41:55 +1000 (EST) (envelope-from matthews_at_melbpc.org.au)
Subject: Types of data sets
To: cavexml_at_cartography.ch
X-Mailer: QUALCOMM Windows Eudora Version 5.1
X-Sender: matthews_at_popa.melbpc.org.au
X-Loop: cavexml
Sender: owner-cavexml_at_karmail.ethz.ch
Precedence: bulk
Reply-To: cavexml_at_cartography.ch
X-Virus-Scanned: by AMaViS perl-11


At 16:50 08-07-02 -0400, Lev Bishop wrote:
>I just read the data model and it seems pretty good, general enough for
>most purposes I can think of, except that I think it doesn't make enough
>of a distinction between what I'll call "field" data and "raw" data. By
>"field" data, I mean the numbers, sketches, observations, etc, as actually
>recorded (in the notebook or whatever) in the field, whereas by "raw" data
>I mean the information which you want to feed to your network adjustment
>[Technique] or whatever. The reason I think there should be both kinds of
>data is that its a good idea to keep the original data (in as close to the
>original format as possible) but sometimes you need to 'fix it up' before
>processing it. For example, the sketch and your memory might strongly
>suggest that a leg has been reversed in the recording (a common
>situation), or you might decide that the explanation for a poor loop
>closure would be that the last shot of the loop actually closed to the
>next station along, or maybe some kind of automatic blunder-detection
>software flags a particular reading as dubious, or you may even just want
>to fudge the data so your streamway flows downhill through a choke... In
>the case of sketches you might want to go back and add detail from memory
>that you didn't sketch at the time (or maybe add some fictitious detail to
>make the survey look more interesting :-). In each case you really don't
>want to touch the original data as recorded in the cave.
>
>Ideally you may want to be able to store 1) scanned images from your
>notebook; 2) direct digital transcriptions of (1) (ie vector tracings of
>sketches, ascii versions of tables of numbers); 3) fixed-up versions of
>(2). Downloaded data from digital survey devices would count as (2). (1)
>and (2) are the "field" data; the "raw" data consists of (3). You want to
>be able to have multiple (3)'s associated with each (2) in case of
>different levels of fixing-up for different purposes. I guess in theory it
>might make sense to allow multiple (2)'s to associate with each (1), for
>the case that there can be disagreement over the transcription of a
>notebook (eg one person thinks that a muddy number is a '7' whereas
>another is convinced that its a '1'), but perhaps that would be taking
>things too far.
>
>Just to be absolutely clear, I'm thinking that both (2) and (3) will have
>the same format, will store the same information (instrument readings,
>calibrations, sketches, etc) and in many cases may even be identical to
>each other. In the case that they are different, the conversion of (2) to
>(3) will necessarily involve a lot of human input, rather than automated
>conversion tools.
>
>Does that make sense?
>
>Lev

Yes that makes a lot of sense to me. I see two issues here, (1) sorting out
the terms used to distinguish the status of a set of data, and (2) being
able to flag that status in a field so that multiple sets of various status
data can be retained.

Below is a kind of summary, also including the later discussion and some of
my own responses to Lev's points, and using terms from the diagram. I have
also included a few posting refs - apologies if I left someone out. We need
to be clear which entity these various items belong to, because each entity
has a quite different set of fields to "describe" it, and hence which have
to be included in the final XML in the right place, so I have also put it
into entity context.

It looks like we have three fundamental types of measured input data which
people might need:

Type 1: the data recorded in the field, or images of that. (Shot, Fieldbook)
Type 2: a digital character-based transcription of the character-based
field data
         as actually recorded, or as downloaded from an instrument. (Shot)
Type 3: an edited version of the Type 2 recorded data, for whatever reason.
(could be
         a Shot or a Leg)

And in any given survey Segment there could be multiple sets of Type 3
created for a range of reasons, so a field or fields would be needed to
give the "why" or status for any particular set. One of these sets would be
the "final" set OK'd for processing into co-ordinates. However what was
sent for processing might be the "average" of *several* Shots, i.e. a "Leg"
in the diagram. A Leg could also be a suitable statistical fiction,
generated by a program after it had done all the loop closures (Lev 11-Jul
20:43 et al). This obviously needs further discussion, but we are jumping
ahead here, so perhaps we can leave it until we get to it, as we work
through the diagram.

The Type 1s are really the content of a Fieldbook: if images were being
used, they would probably be of a page rather than of a single Shot. The
Fieldbook entity could be stored as a list of references to images of its
pages (Lev 9-Jul-02 9:33, Richard 9-Jul 17:06), and any Shot can refer to
the relevant page via one of its fields.

The resulting co-ordinates, the "processed form", are a different set of
data, namely a "Position" entity in the diagram, and they have a completely
different set of fields (Alexander 10-Jul 18:14, Ralph 10-Jul 21:38, Lev
11-Jul 20:22). You might want to store multiple Positions for the same
Station (Richard 12-Jul 6:24) for different processing runs. Anything
beyond processed form, e.g. a map, is a different entity again.

Coming out of the discussion, here are some suggested terms for your
consideration for us to use when referring to the various stages of data
which we have identified. These terms could be the values used in a
"status" field associated with each Shot or Position.

Field data Data and sketches recorded in the field.
Raw data Field data when converted unchanged into character-based
form, or
                downloaded from an instrument.
Edited data Raw data which has had any kind of mistakes edited out of
it, but no
                systematic instrument corrections applied.
Final data The input data which is currently the accepted final version of
                unconsolidated Shot or other measured data, but has no
systematic
                instrument corrections applied.
Leg data The final single set of data ready for reduction processing.
It may be
                the result of consolidation of several sets of final Shot
data, and may
                or may not have its instrument corrections already applied.
It may also
                be synthesised data generated from co-ordinates after loop
adjustments.
Reduced data The co-ordinate set resulting from leg data which has been
reduced to
                co-ordinate form.
Adjusted data The co-ordinate set resulting from leg data which has been
reduced to
                co-ordinate form and includes loop adjustments.

We'll now go through these suggested terms, one by one. So, starting with
the first one, is the definition for "Field data" acceptable?

(This is actually part of the 2nd item in my "Suggested Plan" on the
current task progress web page, namely, "We sort out any tricky non-entity
terms which we may need to use during discussions." We can go back to
considering whether that plan is acceptable after we finish the above.)

Peter


New Message Reply About this list Date view Thread view Subject view Author view

This archive was generated by hypermail 2b30 : Wed Jul 31 2002 - 23:00:00 CEST