Database linking issues
(Draft.
20 October 2006)
I have set out below some of my current
thoughts for what they are worth. Particular attention is given to numerical
identifiers as these are critical to serious database design and management. I
have concentrated on the technical and conceptual issues. Obviously there are
important practical issues of ownership, management and copyright that require
fuller consideration than given here but on the assumption that we are all
interested in seeing this body of work survive and become more rather than less
useful now and in the future, then “where there is the will, there is the way”.
Firstly,
it is fairly amazing, even by the standards of five years ago, the scale of
computing that you can now do in your own living room with basic entry-level
software and computer hardware you can buy for the price of a barely functional
second hand car. I myself only worked out a few years ago that my own computer
was just as happy with 60,000 records as 6,000 and could handle 300,000. It’s
also fairly amazing how much you can share on an internet site that comes as
part of your standard monthly fee – I have the equivalent of a quite
substantial book already on this site and am still using only about a fifth of
my web site space allocation and all for around US 50 cents per day for the
website and unlimited internet access.
There
are, however, limits to what any individual or small team of collaborators can
do. Many of the key individuals developing and maintaining ship databases are
in their 60’s and above. Nobody can work at full capacity indefinitely.
Moreover, younger generations are less likely to be as patient about working
through the minutiae involved is fine-tuning and maintaining major databases
and nor should they be as there is a lot of inefficiency in existing access to
maritime information compared with the impact that the massed genealogists of
the world have had on access to historical personal information (even though
much of what I discuss here is also applicable to their databases and
transcriptions in other formats).
If
we care about our databases being useful in the long-term we should attempt
within the next five to ten years to get them to a stage where they can no
longer ever be critically dependent on any one or two individuals or single
institution and can be used more easily and flexibly including for new
applications that may evolve and minority current applications that should
become more important.
For
maritime research to get the full benefit of the work that has gone into
databases already and that which is likely to be done over the next five to ten
years it is necessary to develop ways that databases can be used in conjunction
to extend each other, to reinforce each other’s strengths and compensate for
each other’s weaknesses and thereby to achieve collectively what none can ever
provide alone. There is much more that can be done with existing databases
already as they are now but ultimately more will be achievable with some
movement towards common conventions and approaches.
Integration
of some databases may well be a productive line of development to explore but
integration is not necessarily essential at all and quite more possibly more of
a fortuitous long term by-product than a critical or even particularly
desirable immediate goal or specific objective. If all historical ship
information was gathered into one database there would be horrendous resource
and maintenance implications and it would be extremely vulnerable.
The
key thing is to develop ways for databases known – more or less as they are –
and others yet unknown to “talk to” each other and to enable information to be
read from a number together, just as one consults several books. This “consultation”
process can undoubtedly be mechanised in various ways with great advantages in
efficiency and accuracy – there is no merit in doing things the hard way
unnecessarily and only a percentage of users will be familiar with all the
possibly relevant resources. I am not familiar with the relevant software
myself but I gather that the necessary programming is quite basic. One form
this could take is something that works like a “google” internet search
directed to one or more “cocktails” of databases that would efficiently lead
people to screens of multiple online sources of information about a particular
ship (rather than undifferentiated information about every ship of the name) in
the original form with all site links visible as if the user had known of the
database and searched it individually (as distinct from “harvesting”
information in a piratical manner).
While
we need to be alert for any potential for trespass on owners’ rights I can see
no reason for objection to operating within the limits of what one may
legitimately do with paper and pencil or otherwise openly and honestly
negotiate. The technology, of course, should also be able to be developed to
give as well as receive under controlled personal arrangements.
Databases
“talking to” each other are already possible within limits despite not having
been designed with that in mind. Obviously, the process could be made more
reliable and efficient with common practices and nomenclature. I am exploring
these aspects within my own index development and will expand upon them later.
In the meantime feel welcome to pose questions and discuss possibilities.
Fully
exploiting the existing identifiers already embodied in primary and other non-computerised
records is a quite separate issue conceptually from developing an external
system of standard unique identifiers for each ship as envisaged by the Global
Ship Number project (http://gsn.ncl.ac.uk/index.jsp)
or the hypothetical alternative of a mega-database extending its scope to all
vessels. An individual vessel will typically have several identifiers of
various types in various records that apply to it for some purpose at different
stages of its life. These will all have some national or local if not
international significance and are potentially useful to have included in one
data file or another and to use to track a particular ship through archived
official records.
It
is the connections between these identifiers that provide the potential means
to connect up several files that contain different information for different
purposes but which share various of the identifiers in common. A mighty
international collective effort might eventually create the ultimate master
index of all such linkages but part of the necessary raw material is already
contained in any databases that provide some alternative identifiers in
parallel fields.
The
function of a centralised master index is potentially achievable in significant
part simply through the alternative identifier fields of major databases being
able to “talk” to each other in similar manner to a network within an
institution. It is only through such databases increasing their content of
alternative identifiers that any such mega index could be developed in any
case. It may be unnecessary to do more than to expand this aspect within
co-operating databases which in principle could be linked to operate as the
decentralised equivalent of a centralised index. There would be obvious risks
and disadvantages but also the attractive practical advantages of not being
critically dependent on any one individual or institution and of needing no
centralised administration or maintenance beyond simple agreed conventions.
Everything
that existing and new databases do which provides alternative numerical
identifiers in parallel fields will add to the raw material required for eventual
database linking. One should not be deterred by inability to record all
relevant identifiers for all ships. Anything that significantly adds to what is
already available is worth having. Some significant advance on this front will
help to demonstrate the practical advantages and demonstrate the advantage of
tackling the remaining gaps in coverage. Practical worked examples will speak
more eloquently than any amount of abstract argument as I hope to be able to
demonstrate.
Port numbers and signal codes as identifiers
In
addition to the obvious LR/IMO numbers and national official numbers, port
numbers and international signal codes will also be invaluable. These are the
principal identifiers in some data sources and may be the only ones (not
artificially created) available for some ships.
“Port
numbers” are relevant in two senses of the term. Firstly, there are port
numbers that are identical in form and function to national official numbers
but which relate to a port (or province) within a nation state, for example
Italy. These have potential wider utility even though they involve more work (because
a ship may have two or three when it would otherwise have only a single
national number).
“Port
numbers” in the sense of port registration records are important and likely to
be the only available standard references for all ships of a nation. They are
especially relevant for those ships that did not survive late enough in the 19th
century to be allocated official numbers or other identifiers. They are basic
to local and regional studies in any case. They typically take the form of “
I
have worked out a system for converting these identifiers into a wholly
numerical format for the purposes of my New Zealand and Australasian indexes
using modern postcodes to identify port and state with the advantage that the
regional codes are well known already from their independent existence (with
the extra advantage for acceptability that the post codes were not imposed on
Australia by a New Zealander). However, I am reworking this strategy to employ
international and city telephone codes as this approach can be applied world
wide.
I am
not yet sufficiently familiar with historical European port records to assess
how what works with British and American records is applicable to them (apart
from Swedish official numbers which are highly relevant). European ships’
international signal codes cannot meet all ship identifier needs because they
do not cover all registered ships but are likely to be able to provide for
those ships in medium and long-distance trade from around 1870 onward which
would provide for an important component of the whole. See signal codes. This could be used in
conjunction with something based on port registrations for other ships to cover
European ships in a manner inherent in the original primary documentation
rather than arbitrarily or externally imposed.
“Finishing off” the databases
Many
apparently finished databases often have significant omissions of information
that is readily available in primary records and may well have no procedure in
place for revising and updating them.
A
common weakness I have identified in a number of cases is that of records for
ships allocated official numbers later in life that include no reference to the
official number in the particular earlier record. Not unnaturally, the official
numbers tend to omitted from the
database in these cases but they do need to be “retrofitted” if all the
relevant records for a ship are to be able to be combined and for avoidable
duplications to be prevented. This is a particular issue for British Empire
ships of the 1840’s and 1850’s – and some even back into the 18th
century – and for American ships of the 1850’s and 1860’s – both important
periods in the maritime history of these nations. It is a particular issue for
Canadian vessels many of which were sold to owners in the
Statistical considerations
It
is desirable that decisions made about what information fields are included in database
indexes also have regard to potential statistical applications. These are
significant in their own right and can only be pursued
in conjunction with general indexation for practical reasons. They are also
relevant to many strategic questions of database strategy and assessment, for
example, as the means to assess how comprehensively a database has covered a
target nation or ship type or compares in coverage with another overlapping
database. Comprehensive records of ships of 100 tons and above would provide
the means to construct earlier statistical comparisons for the invaluable
series of ship statistics compiled by Lloyd’s Register from 1890 onward.
I
will elaborate later as I explore this aspect further but in the meantime feel
welcome to correspond on the matter.
The
following all have useful applications in establishing database coverage,
working out the potential overlap between databases, compiling basic summary
statistics and tracking ships through archived ship registration records: year
of construction, country of construction (and usefully Canadian province and
Australian colony/state of construction), ship type, hull material, country of
original registration, original name, original name within a particular
jurisdiction, signal code (together with nation) of European ships post-1870,
and of course any available numeric and alphanumeric identifiers (including
port numbers in both senses).
Major
statistical analyses will require tailor-made databases or “add-ons” to be used
in conjunction with existing ones. However, if any of the above items of
information are able to be readily incorporated within the resource limits of a
current or planned project they will serve database management and development
purposes as well as help to develop the resources available to provide contexts
for specific individual ships and categories of ship which is a necessary line
of development in maritime history understanding that has been barely touched
in any scientifically systematic way but which has great potential for
advancing knowledge and stimulating new interest in maritime history.
and in
the long run …
“In
the long run, we are all dead” as the economists quip.
Eventually
all databases must taken over by someone else or some institution or managed by
someone else within the same institution or else die with the owner or manager.
All likely institutions to which personal databases could be donated are under
extreme pressure in terms of their amount and type of staff resources and owner
and management pressures to get feet through doors in a more or less inevitable
trend towards “Disneyland-by-the-sea”. In these circumstances primary and
qualitative research cannot receive their full due, no matter how good the
intentions or recognition of the strategic significance in the absence of lowest-common-denominator
appeal. There are existing classic examples of work effectively blocked from
completion or even any practical use as a result of donation to an institution
that lacks the number or type of staff resources or will (or all three) to
complete or maintain it. The risk is endemic and not necessarily through any
fault or limitation of the institution management who must necessarily obey the
dictates of owners and reflect interests and trends in the wider society.
The
logical response is to place databases in multiple institutions (and/or with
other individuals who can meet contractual obligations) on terms that encourage
them to use them to extend their own purposes but which do not permit them to
control or limit the further use and development by others of the original. A
logical and reasonable formula would be that they can do what they like with a donated
database on condition that they make the original available at actual
reasonable cost (ie not as a “cash cow”) to anyone
who wants it, indefinitely or for a specified period. Institutions could reasonably
be expected to agree to that if it is sufficiently useful to their independent
purposes. If they have no substantial related independent purposes
then perhaps that is a warning they are not the right institution.
An
additional or alternative possible answer is to release key data fields into
the public domain, selecting those which are most important for building up a
universal record of ships of the last 200 years or so against which partial,
national and specialist databases can be at least assessed and the gaps
requiring further attention identified. A variety of proprietary productsand indexes (either
proprietary or public domain) of where to find additional information could
then be piggybacked as “add-ons” onto such a public domain universal record. I
am working through the practicalities of this strategy during my own indexation
projects. If the public service element is insufficient inducement for others it
could even be developed as a joint commercial strategy even though hardly the
most obviously profitable enterprise. The internet is replete with successful
examples of giving away something substantial and valuable free and selling
extensions to it - classic examples are internet virus control software given
away to create a market for the “bells and whistles” version and Adobe giving
away free the software to read pdf files to create a
market for their file construction software.
Publication
under copyright of an electronic database is possible within legal parameters
that obligate the publisher to deposit publications with a national library and
thereby a stratagem for ensuring survival of a database where it can be used in
at least one place. A small book in a very limited edition with a database on
CD-ROM in a pocket in the back would qualify but publication with some
accompanying text to CD-ROM alone should itself qualify. I understand that
several dozen ordinary books have already been published locally on CD-ROM and
that the National Library has already set up the means to take account of the
limited life of CD-ROMs as a storage medium. Something similar must be taking
place in other countries.
To
contact me email jloweresearch@ihug.co.nz
To return to main menu