Database linking issues

(Draft. 20 October 2006)

 

 

I have set out below some of my current thoughts for what they are worth. Particular attention is given to numerical identifiers as these are critical to serious database design and management. I have concentrated on the technical and conceptual issues. Obviously there are important practical issues of ownership, management and copyright that require fuller consideration than given here but on the assumption that we are all interested in seeing this body of work survive and become more rather than less useful now and in the future, then “where there is the will, there is the way”.

 

 

Firstly, it is fairly amazing, even by the standards of five years ago, the scale of computing that you can now do in your own living room with basic entry-level software and computer hardware you can buy for the price of a barely functional second hand car. I myself only worked out a few years ago that my own computer was just as happy with 60,000 records as 6,000 and could handle 300,000. It’s also fairly amazing how much you can share on an internet site that comes as part of your standard monthly fee – I have the equivalent of a quite substantial book already on this site and am still using only about a fifth of my web site space allocation and all for around US 50 cents per day for the website and unlimited internet access.

 

There are, however, limits to what any individual or small team of collaborators can do. Many of the key individuals developing and maintaining ship databases are in their 60’s and above. Nobody can work at full capacity indefinitely. Moreover, younger generations are less likely to be as patient about working through the minutiae involved is fine-tuning and maintaining major databases and nor should they be as there is a lot of inefficiency in existing access to maritime information compared with the impact that the massed genealogists of the world have had on access to historical personal information (even though much of what I discuss here is also applicable to their databases and transcriptions in other formats).

 

If we care about our databases being useful in the long-term we should attempt within the next five to ten years to get them to a stage where they can no longer ever be critically dependent on any one or two individuals or single institution and can be used more easily and flexibly including for new applications that may evolve and minority current applications that should become more important.

 

For maritime research to get the full benefit of the work that has gone into databases already and that which is likely to be done over the next five to ten years it is necessary to develop ways that databases can be used in conjunction to extend each other, to reinforce each other’s strengths and compensate for each other’s weaknesses and thereby to achieve collectively what none can ever provide alone. There is much more that can be done with existing databases already as they are now but ultimately more will be achievable with some movement towards common conventions and approaches.

 

Integration of some databases may well be a productive line of development to explore but integration is not necessarily essential at all and quite more possibly more of a fortuitous long term by-product than a critical or even particularly desirable immediate goal or specific objective. If all historical ship information was gathered into one database there would be horrendous resource and maintenance implications and it would be extremely vulnerable.

 

The key thing is to develop ways for databases known – more or less as they are – and others yet unknown to “talk to” each other and to enable information to be read from a number together, just as one consults several books. This “consultation” process can undoubtedly be mechanised in various ways with great advantages in efficiency and accuracy – there is no merit in doing things the hard way unnecessarily and only a percentage of users will be familiar with all the possibly relevant resources. I am not familiar with the relevant software myself but I gather that the necessary programming is quite basic. One form this could take is something that works like a “google” internet search directed to one or more “cocktails” of databases that would efficiently lead people to screens of multiple online sources of information about a particular ship (rather than undifferentiated information about every ship of the name) in the original form with all site links visible as if the user had known of the database and searched it individually (as distinct from “harvesting” information in a piratical manner).

 

While we need to be alert for any potential for trespass on owners’ rights I can see no reason for objection to operating within the limits of what one may legitimately do with paper and pencil or otherwise openly and honestly negotiate. The technology, of course, should also be able to be developed to give as well as receive under controlled personal arrangements.

 

Databases “talking to” each other are already possible within limits despite not having been designed with that in mind. Obviously, the process could be made more reliable and efficient with common practices and nomenclature. I am exploring these aspects within my own index development and will expand upon them later. In the meantime feel welcome to pose questions and discuss possibilities.

 

Fully exploiting the existing identifiers already embodied in primary and other non-computerised records is a quite separate issue conceptually from developing an external system of standard unique identifiers for each ship as envisaged by the Global Ship Number project (http://gsn.ncl.ac.uk/index.jsp) or the hypothetical alternative of a mega-database extending its scope to all vessels. An individual vessel will typically have several identifiers of various types in various records that apply to it for some purpose at different stages of its life. These will all have some national or local if not international significance and are potentially useful to have included in one data file or another and to use to track a particular ship through archived official records.

 

It is the connections between these identifiers that provide the potential means to connect up several files that contain different information for different purposes but which share various of the identifiers in common. A mighty international collective effort might eventually create the ultimate master index of all such linkages but part of the necessary raw material is already contained in any databases that provide some alternative identifiers in parallel fields. MIRAMAR and GSN already have three and Ted Finch’s official number index already provides a partial translator for British and American official numbers. My planned indexes will extend the principle.

 

The function of a centralised master index is potentially achievable in significant part simply through the alternative identifier fields of major databases being able to “talk” to each other in similar manner to a network within an institution. It is only through such databases increasing their content of alternative identifiers that any such mega index could be developed in any case. It may be unnecessary to do more than to expand this aspect within co-operating databases which in principle could be linked to operate as the decentralised equivalent of a centralised index. There would be obvious risks and disadvantages but also the attractive practical advantages of not being critically dependent on any one individual or institution and of needing no centralised administration or maintenance beyond simple agreed conventions.

 

Everything that existing and new databases do which provides alternative numerical identifiers in parallel fields will add to the raw material required for eventual database linking. One should not be deterred by inability to record all relevant identifiers for all ships. Anything that significantly adds to what is already available is worth having. Some significant advance on this front will help to demonstrate the practical advantages and demonstrate the advantage of tackling the remaining gaps in coverage. Practical worked examples will speak more eloquently than any amount of abstract argument as I hope to be able to demonstrate.

 

 

Port numbers and signal codes as identifiers

 

In addition to the obvious LR/IMO numbers and national official numbers, port numbers and international signal codes will also be invaluable. These are the principal identifiers in some data sources and may be the only ones (not artificially created) available for some ships.

 

“Port numbers” are relevant in two senses of the term. Firstly, there are port numbers that are identical in form and function to national official numbers but which relate to a port (or province) within a nation state, for example Italy. These have potential wider utility even though they involve more work (because a ship may have two or three when it would otherwise have only a single national number).

 

“Port numbers” in the sense of port registration records are important and likely to be the only available standard references for all ships of a nation. They are especially relevant for those ships that did not survive late enough in the 19th century to be allocated official numbers or other identifiers. They are basic to local and regional studies in any case. They typically take the form of “Wellington 4/1854” meaning the 4th registration in Wellington in 1854. The initial registration is likely to lead to subsequent registrations expressed in the same form generally following changes in ownership but occasionally some other reason. The (generally handwritten) primary records themselves should provide the forward links. The earliest for a particular ship can serve as a unique identifier in the absence of an official number. Such a strategy is self-correcting because all such port numbers are unique and can only lead backward or forward to others for the same ship so if one incorrectly identifies the second as the first it cannot identify the wrong ship and will eventually correct itself.

 

I have worked out a system for converting these identifiers into a wholly numerical format for the purposes of my New Zealand and Australasian indexes using modern postcodes to identify port and state with the advantage that the regional codes are well known already from their independent existence (with the extra advantage for acceptability that the post codes were not imposed on Australia by a New Zealander). However, I am reworking this strategy to employ international and city telephone codes as this approach can be applied world wide.

 

I am not yet sufficiently familiar with historical European port records to assess how what works with British and American records is applicable to them (apart from Swedish official numbers which are highly relevant). European ships’ international signal codes cannot meet all ship identifier needs because they do not cover all registered ships but are likely to be able to provide for those ships in medium and long-distance trade from around 1870 onward which would provide for an important component of the whole. See signal codes. This could be used in conjunction with something based on port registrations for other ships to cover European ships in a manner inherent in the original primary documentation rather than arbitrarily or externally imposed.

 

“Finishing off” the databases

 

Many apparently finished databases often have significant omissions of information that is readily available in primary records and may well have no procedure in place for revising and updating them.

 

A common weakness I have identified in a number of cases is that of records for ships allocated official numbers later in life that include no reference to the official number in the particular earlier record. Not unnaturally, the official numbers tend to  omitted from the database in these cases but they do need to be “retrofitted” if all the relevant records for a ship are to be able to be combined and for avoidable duplications to be prevented. This is a particular issue for British Empire ships of the 1840’s and 1850’s – and some even back into the 18th century – and for American ships of the 1850’s and 1860’s – both important periods in the maritime history of these nations. It is a particular issue for Canadian vessels many of which were sold to owners in the United Kingdom where they were subsequently allocated official numbers with which the Canadian authorities had no direct connection and no interest or responsibility unless the ship returned to local registration.

 

Statistical considerations

 

It is desirable that decisions made about what information fields are included in database indexes also have regard to potential statistical applications. These are significant in their own right and can only be pursued in conjunction with general indexation for practical reasons. They are also relevant to many strategic questions of database strategy and assessment, for example, as the means to assess how comprehensively a database has covered a target nation or ship type or compares in coverage with another overlapping database. Comprehensive records of ships of 100 tons and above would provide the means to construct earlier statistical comparisons for the invaluable series of ship statistics compiled by Lloyd’s Register from 1890 onward.

 

I will elaborate later as I explore this aspect further but in the meantime feel welcome to correspond on the matter.

 

The following all have useful applications in establishing database coverage, working out the potential overlap between databases, compiling basic summary statistics and tracking ships through archived ship registration records: year of construction, country of construction (and usefully Canadian province and Australian colony/state of construction), ship type, hull material, country of original registration, original name, original name within a particular jurisdiction, signal code (together with nation) of European ships post-1870, and of course any available numeric and alphanumeric identifiers (including port numbers in both senses).

 

Major statistical analyses will require tailor-made databases or “add-ons” to be used in conjunction with existing ones. However, if any of the above items of information are able to be readily incorporated within the resource limits of a current or planned project they will serve database management and development purposes as well as help to develop the resources available to provide contexts for specific individual ships and categories of ship which is a necessary line of development in maritime history understanding that has been barely touched in any scientifically systematic way but which has great potential for advancing knowledge and stimulating new interest in maritime history.

 

 

and in the long run …

 

“In the long run, we are all dead” as the economists quip.

 

Eventually all databases must taken over by someone else or some institution or managed by someone else within the same institution or else die with the owner or manager. All likely institutions to which personal databases could be donated are under extreme pressure in terms of their amount and type of staff resources and owner and management pressures to get feet through doors in a more or less inevitable trend towards “Disneyland-by-the-sea”. In these circumstances primary and qualitative research cannot receive their full due, no matter how good the intentions or recognition of the strategic significance in the absence of lowest-common-denominator appeal. There are existing classic examples of work effectively blocked from completion or even any practical use as a result of donation to an institution that lacks the number or type of staff resources or will (or all three) to complete or maintain it. The risk is endemic and not necessarily through any fault or limitation of the institution management who must necessarily obey the dictates of owners and reflect interests and trends in the wider society.

 

The logical response is to place databases in multiple institutions (and/or with other individuals who can meet contractual obligations) on terms that encourage them to use them to extend their own purposes but which do not permit them to control or limit the further use and development by others of the original. A logical and reasonable formula would be that they can do what they like with a donated database on condition that they make the original available at actual reasonable cost (ie not as a “cash cow”) to anyone who wants it, indefinitely or for a specified period. Institutions could reasonably be expected to agree to that if it is sufficiently useful to their independent purposes. If they have no substantial related independent purposes then perhaps that is a warning they are not the right institution.

 

An additional or alternative possible answer is to release key data fields into the public domain, selecting those which are most important for building up a universal record of ships of the last 200 years or so against which partial, national and specialist databases can be at least assessed and the gaps requiring further attention identified. A variety of proprietary productsand  indexes (either proprietary or public domain) of where to find additional information could then be piggybacked as “add-ons” onto such a public domain universal record. I am working through the practicalities of this strategy during my own indexation projects. If the public service element is insufficient inducement for others it could even be developed as a joint commercial strategy even though hardly the most obviously profitable enterprise. The internet is replete with successful examples of giving away something substantial and valuable free and selling extensions to it - classic examples are internet virus control software given away to create a market for the “bells and whistles” version and Adobe giving away free the software to read pdf files to create a market for their file construction software.

 

Publication under copyright of an electronic database is possible within legal parameters that obligate the publisher to deposit publications with a national library and thereby a stratagem for ensuring survival of a database where it can be used in at least one place. A small book in a very limited edition with a database on CD-ROM in a pocket in the back would qualify but publication with some accompanying text to CD-ROM alone should itself qualify. I understand that several dozen ordinary books have already been published locally on CD-ROM and that the National Library has already set up the means to take account of the limited life of CD-ROMs as a storage medium. Something similar must be taking place in other countries.

 

 

To contact me email jloweresearch@ihug.co.nz

 

To return to main menu