Biodiversity Information Standards (TDWG) logo

Darwin Core Text Guide

Title: Darwin Core Text Guide
Date Issued: 2009-02-12
Date Modified: 2009-07-06
Abstract: Guidelines for implementing Darwin Core in Text files.
Contributors: Tim Robertson (GBIF), Markus Döring (GBIF), John Wieczorek (MVZ), Renato De Giovanni (CRIA), Dave Vieglais (KUNHM)
Legal: This document is governed by the standard legal, copyright, licensing provisions and disclaimers issued by the Taxonomic Databases Working Group.
Part of TDWG Standard: ***URL to DwC Standard*** will go here
Creator: Darwin Core Task Group
Identifier: http://rs.tdwg.org/dwc/2009-07-06/terms/guides/text/
Latest Version: http://rs.tdwg.org/dwc/terms/guides/text/
Replaces: http://rs.tdwg.org/dwc/2009-05-25/terms/guides/text/
Document Status: For Public Review.

1. Introduction

Audience: This document is targeted toward those who wish to use or share information based on the Darwin Core terms using text files. It provides technical details on how to construct these files and complementary metadata files that describe their content.

This document provides guidelines for formatting and sharing Darwin Core terms [TERMS] in fielded text formats, such as one or more comma separated value (CSV) files. Data conforming to the Simple Darwin Core [SIMPLEDWC] (CSV format and having the first row include Darwin Core standard term names) can be shared in a single file, while a non-standard text file can be understood using an [XML] metafile to describe its contents and formatting.

More complex structure can be shared in multiple related files. The description of content and relationships between files can be achieved using the metafile. This guideline makes recommendations for the simple case of a core file, upon which Darwin Core records are based, and extensions that are linked to records in that core file. Specifically, extension records have a many-to-one relationship with records in the core file. For example, a core file might contain specimen records, with one specimen per row in the file, while an extension file contains one or more identifications for those specimens, with one identification per row in the extension file, and with an identifier to the specimen for each identification row. This example would allow many identifications to be associated with each specimen.

1.1 Simple Example Metafile Content

A simple comma separated values (CSV) data file with the following content:
ID,Species,Count
123,"Cryptantha gypsophila Reveal & C.R. Broome",12
124,"Buxbaumia piperi",2 
can be described with the following Darwin Core metafile:
<?xml version="1.0" encoding="UTF-8"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xsi:schemaLocation="http://rs.tdwg.org/dwc/text/ http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd">
  <core rowType="http://rs.tdwg.org/dwc/xsd/simpledarwincore/SimpleDarwinRecord" ignoreHeaderLines="1">
    <files>
      <location>http://data.gbif.org/download/specimens.csv</location>
    </files>
    <field index="0" term="http://rs.tdwg.org/dwc/terms/catalogNumber" type="xs:integer"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/scientificName" type="xs:string"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/individualCount" type="xs:integer"/>
    <!-- A constant value has no index, but applies to all rows -->
    <field term="http://rs.tdwg.org/dwc/terms/datasetID" type="xs:string" default="urn:lsid:tim.lsid.tdwg.org:collections:1"/>
  </core>
</archive>

These same data could be understood without the metafile if the the first row of the contents of the CSV file used Darwin Core terms:
catalogNumber,scientificName,individualCount,datasetID
123,"Cryptantha gypsophila Reveal & C.R. Broome",12,urn:lsid:tim.lsid.tdwg.org:collections:1
124,"Buxbaumia piperi",2,urn:lsid:tim.lsid.tdwg.org:collections:1

1.2 XML versus Fielded Text

Many resources exist on the web describing the advantages of Extensible Markup Language [XML] over less structured content such as fielded text. The Darwin Core Text Guide (this document) is not promoting the use of fielded text over XML for data exchange, but rather providing recommendations for how to handle such data files when necessary.
Two scenarios that might benefit from the use of fielded text are:

2. Metafile Content

The text metafile schema [TEXTSCHEMA] provides technical details for the structure of a metafile by defining the elements and attributes required to describe the contents and relationships between text files. These elements and attributes, with descriptions and specifications for their use in a metafile, are described in the following table.

2.1 The <archive> element

The <archive> element is the container for the list of related files (one core and zero or more extensions). The <archive> element has just one attribute, metadata.

Attributes
Attribute Description Required Default
metadata Contains a qualified Uniform Resource Locator (URL) defining the location of a metadata description of the entire archive. The format of the metadata is not prescribed, but a standardized format such as Ecological Metadata Language (EML), Federal Geographic Data Committee (FGDC), or ISO 19115 family is recommended.
Elements
Element Description
<core> An <archive> must contain exactly one <core> element, representing the data entity (the actual file and its column header mappings to Darwin Core terms) upon which records are based.
If extensions are being used, each record in the core data must have a unique identifier. The field for this identifier must be specified in an explicit <id> field in order to associate extension records with the core record.
<extension> An <archive> may define zero or more <extension> elements, each representing an individual extension entity directly related to the core. In addition to the general file attributes described below, every extension entity must have an explicit <coreId> field to relate the extension record to a row in the core entity. The extension itself does not have to have a unique ID field and many rows can point to the same core record.

2.2 The <core> or <extension> element

Attributes
Attribute Description Required Default
rowType A Unified Resource Identifier (URI) for the term identifying the class of data represented by each row, for example, http://rs.tdwg.org/dwc/terms/Occurrence for Occurrence records or http://rs.tdwg.org/dwc/terms/Taxon for Taxon records. Additional classes may be referenced by URI and defined outside the Darwin Core specification. The row type defaults to the ambiguous SimpleDarwinRecord. For convenience the URIs for classes defined by the Darwin Core are listed below:
Simple Darwin Record
http://rs.tdwg.org/dwc/xsd/simpledarwincore/SimpleDarwinRecord
Dataset
http://rs.tdwg.org/dwc/terms/Dataset
Occurrence
http://rs.tdwg.org/dwc/terms/Occurrence
Event
http://rs.tdwg.org/dwc/terms/Event
Location
http://purl.org/dc/terms/Location
Identification
http://rs.tdwg.org/dwc/terms/Identification
Taxon
http://rs.tdwg.org/dwc/terms/Taxon
ResourceRelationship
http://rs.tdwg.org/dwc/terms/ResourceRelationship
MeasurementOrFact
http://rs.tdwg.org/dwc/terms/MeasurementOrFact
http://rs.tdwg.org/dwc/xsd/simpledarwincore/SimpleDarwinRecord
fieldsTerminatedBy Specifies the delimiter between fields. Typical values might be "," or "\t" for CSV or Tab files respectively. \t
linesTerminatedBy Specifies the row separator character. \n
fieldsEnclosedBy Specifies the character used to enclose (mark the start and end of) each field. CSV files frequently use the double quote character ("), but the default is no enclosing character. Note that a comma separated value file that has commas within the content of any field must have an enclosing character.
compression Specifies the compression used for the file. May be omitted or specified as one of:
GZIP
Data file is compressed as GZIP.
ZIP
Data file is compressed as ZIP (PKZIP, WinZip, StuffIt, etc.).
encoding Specifies the character encoding for the data file. The encoding is extremely important, but often ignored. One of:
UTF-8
8-bit Unicode Transformation Format.
UTF-16
16-bit Unicode Transformation Format.
ISO-8859-1
Commonly known as Latin-1 and a common default on systems configured for a single western European language.
Windows-1252
Commonly known as WinLatin and a common default of legacy versions of Microsoft Windows based operating systems.
ISO-8859-1
ignoreHeaderLines Specifies the number lines to ignore from the beginning of the file. This can be used to ignore files with column headings or preamble comments for example. 0
dateFormat When verbatim dates are consistent in format, this field can be used to indicate the format represented. It is recommended to use the date, dateTime and time for field formats wherever possible, but where verbatim dates are required, a format may be specified here. This should be considered a 'hint' for consumers. It is recommended that consumers support the minimum combinations of DD MM and YYYY with the separators / and -. Examples:
DDMMYYYY
For dates of the form 21121978
DD-MM-YYYY
For dates of the form 21-12-1978
MMDDYYYY
For dates of the form 12211978
MM-DD-YYYY
For dates of the form 12-21-1978
YYYYMMDD
For dates of the form 19781221
Elements
Element Description
<files> <core> or <extension> element must contain one <files> element to locate the data being described.
<id> If extensions are being used, the <core> must contain an <id> element that indicates the identifier for a record.
<coreId> If extensions are being used, the <extension> element must contain a <coreId> element that indicates the column in the extension file that contains the core record identifier (the matching <id> in the core file).
<field> A <core> or <extension> element must contain one or more <field> elements, each representing a 'column' in the row.

2.3 The <files> element

The files element must contain one or more <location> elements, each defining where a file resides. Each core or extension entity can be composed from one or more files. If an entity has data in more than one file, use the <location> element multiple times, once for each file that makes up the entity.

Elements
Element Description
location Specifies the location of the file being described, which may take either of the following forms:
  • A web accessible URL such as "http://www.gbif.org/data/specimen.csv" or "ftp://ftp.gbif.org/tim/specimen.txt".
  • A filepath relative to the location of the metafile such as "specimen.txt","./specimen.txt", "data/specimen.txt".

2.4 The <field> element

The field element is used to specify the location and content of data within a file. There must be one field element for every term being shared for the entity, whether explicitly or through the use of a default value for all rows in the file.

Attributes
Attribute Description Required Default
index Specifies the position of the column in the row. The first column has an index of 0, the second column 1, etc. If no column index is specified, then the term and the default may be used to define a constant value for all rows.
term A Unified Resource Identifier (URI) for the term represented by this field. For example, a field containing the scientific name would have term="http://rs.tdwg.org/dwc/terms/scientificName". Terms outside of the Darwin Core specification may be used, such as those from the Dublin Core Metadata Initative, for example, dcterms:modified would be term="http://purl.org/dc/terms/modified".
type Specifies the type of the data content in the column. This is restricted to any simple type (xs:integer, xs:nonNegativeInteger, xs:date, etc.; see guidelines for field types). xs:string
default Specifies value to use if one is not supplied for the field in a given row. If no index is supplied, the default can be used to define a constant for all rows for a field that is not in the data file.

3. Implementation Guide

3.1 Extension example

The following example illustrates the use of extensions. In this example there are three files in the archive, all of which are located in the same directory as the metafile. The whales.txt file acts as a core file of Taxon records. The whales.txt file is extended by two other files, types.txt and distribution.txt. The types.txt file contains records of a type specified in an external definition at http://http://rs.gbif.org/terms/1.0/Types and consists of Dublin Core and Darwin Core terms, while the distribution.txt file contains records of a type specified at http://http://rs.gbif.org/terms/1.0/Distribution and consists of Darwin Core terms plus an additional term for threatStatus. Both extension files are related to the core file by the taxonNameID fields. Presumably, this archive contains information about whale species, type specimen records for those species, and lists of countries and the threat status for those species.

<?xml version="1.0" encoding="UTF-8"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xsi:schemaLocation="http://rs.tdwg.org/dwc/text/ http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd">
	
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" ignoreHeaderLines="1" 
   rowType="http://rs.tdwg.org/dwc/terms/Taxon">
    <files>
        <location>whales.txt</location>
    </files>
    <id index="0" term="http://rs.tdwg.org/dwc/terms/taxonNameID"/>
    <field index="1" term="http://purl.org/dc/terms/modified" type="xs:dateTime"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/acceptedScientificNameID"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/higherTaxonNameID"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/basionymID"/>
  </core>
	
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy='"' ignoreHeaderLines="1"
   rowType="http://rs.gbif.org/terms/1.0/Types">
    <files>
        <location>types.csv</location>
    </files>
    <coreId index="0" term="http://rs.tdwg.org/dwc/terms/scientificNameID"/>
    <field index="1" term="http://purl.org/dc/terms/bibliographicCitation"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/typeStatus"/>
  </extension>
	
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy='"' ignoreHeaderLines="1"
     rowType="http://rs.gbif.org/terms/1.0/Distribution">
    <files>
        <location>distribution.csv</location>
    </files>
    <coreId index="0" term="http://rs.tdwg.org/dwc/terms/scientificNameID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/country"/>
    <field index="2" term="http://www.iucn.org/redlist/3.1/threatStatus"/>
    <field index="3" term="http://purl.org/dc/terms/source"/>
  </extension>
</starArchive>

3.2 Field Type Guidelines

The default value for type attribute of the <field> element is "string", and this should suffice for data type of most terms. Following is a list of exceptions with recommended data types and example content:

Non string term mappings
Term Recommended Types Comments
http://rs.tdwg.org/dwc/terms/maximumDepthInMetersxs:double, xs:integerExample: 100
http://rs.tdwg.org/dwc/terms/minimumDepthInMetersxs:double, xs:integerExample: 50
http://rs.tdwg.org/dwc/terms/maximumElevationInMetersxs:double, xs:integerExample: 3700
http://rs.tdwg.org/dwc/terms/minimumElevationInMetersxs:double, xs:integerExample: 3600
http://rs.tdwg.org/dwc/terms/maximumDistanceAboveSurfaceInMetersxs:double, xs:integerExample: 10.5
http://rs.tdwg.org/dwc/terms/minimumDistanceAboveSurfaceInMetersxs:double, xs:integerExample: 10
http://rs.tdwg.org/dwc/terms/decimalLatitudexs:doubleValid values between -90 and +90 inclusive, keep all precision.
http://rs.tdwg.org/dwc/terms/decimalLongitudexs:doubleValid values between -180 and +180, keep all precision.
http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMetersxs:double, xs:integerExample: 30 (0 is not a valid value).
http://rs.tdwg.org/dwc/terms/pointRadiusSpatialFitxs:doubleExamples: 1 (perfect fit), 1.5708 (circle bounding a square).
http://rs.tdwg.org/dwc/terms/footprintSpatialFitxs:doubleExample: 0 (false representation), 1.2732 (square bounding a circle).
http://rs.tdwg.org/dwc/terms/individualCountxs:integerExample: 1
http://rs.tdwg.org/dwc/terms/startDayOfYearxs:integerExample: 1 is January 1st, 60 is March 1st or February 29th (leap years).
http://rs.tdwg.org/dwc/terms/endDayOfYearxs:integerExample: 32 is February 1st, 365 or 366 (leap years) is December 31st.
http://rs.tdwg.org/dwc/terms/dayxs:integerExample: 1 (the 1st of the month).
http://rs.tdwg.org/dwc/terms/monthxs:integerExamples: 1 (January), 12 (December).
http://rs.tdwg.org/dwc/terms/yearxs:integerUse the form CCYY. Example: 2001
http://purl.org/dc/terms/modifiedxs:dateTime, xs:dateExamples: 2009-05-25T16:52-0800, 2009-05-25
http://rs.tdwg.org/dwc/terms/eventDatexs:dateTime, xs:dateExample: 2008-10-13T1400+1000/2100+1000, 2008-10-13
http://rs.tdwg.org/dwc/terms/measurementDeterminedDatexs:dateTime, xs:dateExample: 1963-03-08T14:07-0600
http://rs.tdwg.org/dwc/terms/dateIdentifiedxs:dateExample: 1971-08-03
http://rs.tdwg.org/dwc/terms/relationshipEstablishedDatexs:dateExample: 2008-05-10
http://rs.tdwg.org/dwc/terms/eventTimexs:timeExample: 14:09Z

4. Database Example

4.1 MySQL

It is very easy to produce fielded text using the SELECT INTO outfile command from MySQL. The encoding of the resulting file will depend on the server variables and collations used, and might need to be modified before the operation is done. Note that MySQL will export NULL values as \N by default. Use the IFNULL() function as shown in the following example to avoid this.
SELECT 
  IFNULL(id, ''), IFNULL(scientific_name, ''), IFNULL(count,'') 
    INTO outfile '/tmp/dwc.txt' 
      FIELDS TERMINATED BY ',' 
      OPTIONALLY ENCLOSED BY '"' 
      LINES TERMINATED BY '\n' 
FROM 
  dwc;

5. Tools

For tools and applications, including a Java-based application to read Darwin Core text archives, see the the Darwin Core Tools and Applications Wiki page [TOOLS].


Creative Commons License Copyright 2009 - Biodiversity Information Standards - TDWG - Contact Us

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 United States License.