Data Management
Dataset
Data Publication
Data Management
Backups
Field Collection
Basic File Structure and Organization
Handling Dates and Times
Documentation
Resources
What is a dataset?
A dataset, for the purposes of NM EPSCoR data publication, is any data object that has been collected or generated with full or partial funding from NM EPSCoR that is referenced or used in your thesis, dissertation or any other publication. It is not limited to only those data objects aggregated and packaged into a chart in the publication; rather it is the entire collection of raw, QAQCed or analyzed data that went into the publication.
In some cases, a dataset may be the product of a piece of equipment funded by NM EPSCoR that you are responsible for but may not be related to your main research efforts. For example, some equipment is purchased to expand the network of SNOTEL-ready sites in New Mexico and your role is installation and maintenance as part of your GA responsibilities but your research is about water quality. Any data generated by those instruments should be delivered as well, regardless of whether you incorporated the data into your thesis or dissertation.
What is a data publication?
Briefly, a data publication is the published dataset with its associated documentation. Depending on the publication outlet, this documentation can be a narrative text describing the processes used to create the data and the final data product.
When putting together the data and documentation for publication in the NM EPSCoR data portal, please consider the following:
- This is a citable resource.
- This is a publication.
- Your name and your PI/advisor’s name are attached to this data.
The impetus for data publication is not the NM EPSCoR state office or EDAC. This is part of a larger trend towards open data across federal agencies such as the USGS and NSF with support from recent White House initiatives. Being able to effectively manage and document your data are valuable skills to have in this environment. If nothing else, taking the time to put into place good data management practices can have long-term benefits for your research group and your effectiveness in building on previous work.
Before delivering your datasets for publication, take some time to look over the data files.
- Are the field names consistent?
- Are they spelled correctly?
- Do the columns align?
- Are the files complete?
- If you have location data, do the site identifiers match the coordinates? For example, in a spreadsheet with id, x, y where id can be repeated, does each instance of the id have the same values in x and y?
- If you have location data, do the recorded locations make sense? You know your study area - if you plot your site locations in Google Earth, for example, are they located where you expect them to be?
- Are the dates meaningful? For example, a record has the date 4/31/2013.
- Are the dates, and times, formatted correctly and consistently?
- Have you removed any ephemera from the spreadsheets? For example, your spreadsheet includes a chart or a tab with a previous iteration of the analysis.
- Are NODATA values used consistently?
- Have you documented any QAQC flags?
- Have you provided the data in a non-proprietary format if at all possible?
- Are the number of digits following the decimal consistent? Are they meaningful based on the limitations of your equipment?
- Are you relying on color-coding to differentiate between anything? Keep in mind that that kind of flag will be removed when converting to other formats, leaving the user unaware of information related to the record that could be important.
- Can your data be understood in a basic format, assuming it came from a spreadsheet? If I only have access to a text version, have I lost critical information?
Take the time to review your data files as you would proofread your dissertation or thesis.
Finally, if you needed to find these data and understand the files a year from now, could you do that? What about after five years? Or if you were new to the research group and needed to use these data. Could you?
Related Information: See the Documentation section
- Data Publication Policy
- For information regarding publications, see the NM EPSCoR Data Policy
- Patents
- PIs will need to follow their own institutions patent policies and rules.
- Embargo Policy
- For information regarding Embargo Policy, see the NM EPSCoR Data Policy
- EDAC’s role in data publication
- Data will be made available through several mechanisms. First, during the data and information gathering portion of the project, most data will be available to all project participants via a password-controlled website that houses the virtual lab notebook, copies of non-copyrighted materials, drafts of working white papers and publications, and other data and information generated during the course of the project. Two exceptions include survey data that contain names of human subjects and data and information related to inventions that are to be patented.
- With the exception of climate and water data that are collected automatically by State and Federal agencies and that will be made available as soon after collection as the agency allows, other data collected through this project may be embargoed for a period of up to one year to allow time for publication by students and researchers; any exceptions to this embargo period must be approved by the Project Director in writing.
File Names and Organization
File names should be applied consistently across a set of data and within projects. Data should also be separated conceptually before publication. Rather than a single spreadsheet containing soils data, tree ring data, water quality data and channel characterization data, separate the tabs into conceptually meaningful files, ie. a file for soils and a file for water quality data. This is more effective for documentation as well as being easier to locate on the file system.
Avoid spaces and special characters. Use underscores instead of hyphens (or spaces).
When including the date or datetime in a filename, standard practice is to use the yyyyMMDD format for dates and yyyyMMDD_HHMMSS for datetimes.
If you need to maintain different versions of a dataset, the best practice is to use version control (think of MS Word’s “Track Changes” option). There are a number of open source systems, such as Git or Subversion, that can be run locally or within a research group. However, if that isn’t an option, manage the versions using folders rather than simply adding a modifier to the file name.
Bad:
- pump_pressures_synthA20_a.csv
- pump_pressures_synthA20_b.csv
Better:
- pump_pressures_20130503
- pump_pressures_synthA20_20130503.csv
- pump_pressures_20130504
- pump_pressures_synthA20_20130504.csv
Document the folder naming conventions and file naming conventions. Include definitions for any identifier codes or acronyms used.
Related Information: Field Collection and Site Identifiers
Backups
Make regular backups of your data. This should include original data files (for example, DAT files from your logger), working files (for example, exploratory analysis outputs from R), and any documentation for your work.
Check with your advisor or lab manager for backup options within your research group. Check with your IT department to see if your institution provides storage space for students.
There are a variety of options available to you - inexpensive thumbdrives, inexpensive external drives, Dropbox, Google Drive, etc. Take advantage of those and backup your data.
Note: check with your advisor or lab manager about any policies within your research group or related to your grant about making personal copies of your data and follow any policies in place.
Backup your data regularly.
Field Collection
One of the most important things to remember is to keep track of where your sites are. While installing the instrument or, if doing one-off collections, be sure to note the coordinates of the site. The minimum information that you should be collecting about your site: the site identifier, the XY coordinates, and the projection and/or datum used by the equipment you’re using to geolocate your site. Double-check the default settings of your GPS; double-check the settings of your GPS.
If you are not involved with initial installation of the instruments but are using those sites for your own data collection, be sure to get that information from whoever is responsible for it in your research group.
Make a note about the equipment you use to collect your data. Things like the manufacturer and name of the sensor. Any limitations related to the sensor. Do you need to calibrate the sensor regularly? If so, make a note and include the calibration step in your documentation.
Site Identifiers
Be sure to give your sites a unique and meaningful site identifier. If the identifier is a code, for example RE-43, we encourage you to provide a descriptive title as well as the code. For our example, RE-43 is a field collection site on Redondo Peak so a descriptive title could be as simple as “Redondo Peak 43”.
Bear in mind also, that you are likely not the first person to be collecting data within a particular research area. So our earlier example of Redondo Peak 43 may be unique to your field work, but not within the larger set of sites across the area. Be mindful of other datasets available in your area and consider a naming convention that will help avoid confusion later.
If your fieldwork involves taking multiple samples at the same location, add an additional field to differentiate between the samples. We strongly discourage the use of combined site identifier and sample identifier in a single column in the output files. In addition, consider developing a consistent set of terms (when appropriate) to aid in identifying the various samples and document those terms. For example, at site RE-43, I take soil samples at three different depths. Rather labelling the records as RE-43a, RE-43b and RE-43c, include a site field with RE-43 and a second sample field with its own meaningful identifier. (Realistically, in this example, you should have a depth field anyway.)
Related Information: File Names and Organization
Related Information: Basic File Structure and Organization
Basic File Structure and Organization
When creating files for timeseries and/or spatial data, we recommend using columns for site identifiers and datetimes rather than using those as tab identifiers.
It is also a good idea to look at other examples of data files published in your field. How are those files structured? If you think you need to create an idiosyncratic format, ask yourself if the new structure enhances usability, understandability or long-term viability.
Quality Control Information
If using a NODATA flag, be sure that the value, especially if it’s numeric, falls well outside the range of possible data values for the field. For example, your sensor has a range of -1000 to 1000. A NODATA value of -99 in this field would be mistaken for a valid data point and used in aggregations or analysis, skewing the results. Best practice is to use a NODATA value that falls outside the range of any numeric data field in the data file. A common value is -9999.
Similarly, using zero for NODATA, zero for missing data, and using zero as a valid data value is also confusing. A better option is to use a true NODATA value, such as -9999, for NODATA and missing data in combination with a QAQC flag to explain the NODATA value. For example, you have a data file where you would like to differentiate between sensor failure and gaps in the data related to known downtimes. The data value would be -9999 for both and the QAQC field would be used to note ‘sensor failure’ vs. ‘sensor inactive’.
If using Excel or similar spreadsheet software, QAQC information should never be encoded in the file using color or other text formatting alone. If using Excel or similar spreadsheet software, color and/or text formatting should never be used for anything that modifies or describes a data value. Always remember that these files can and will be converted to plain text formats and that information will be lost.
Document the terms used in any QAQC fields (or any other modifier of a field). What does good mean for that data?
Basic QAQCed data
Site ID | Datetime | Temp_F | Temp_Qual |
---|---|---|---|
RE-43 | 2013-05-23T04:20:00 | 67.10 | Good |
RE-41 | 2013-05-23T05:00:00 | 68.03 | Poor |
Site with sample modifier
Site ID | Sample ID | pH | soil_temp_f |
---|---|---|---|
RE-43 | 101 | 3.50 | 58.00 |
RE-43 | 102 | 3.75 | 57.45 |
Handling Dates and Times
When including the date or datetime in a filename, standard practice is to use the yyyyMMDD format for dates and yyyyMMDD_HHMMSS for datetimes.
When including the date or datetime in a spreadsheet or delimited text file, standard practice is to use the yyyyMMDD format for dates and yyyyMMDDTHHMMSS for datetimes. Note the ‘T’ between the date and time components of the second format. Dashes to delineate the date parts are acceptable within the data file; the provided formats or underscores for filenames. Separate fields for the date and time are also acceptable.
Make sure that the dates and times are valid dates and times. 4/31/2013 does not exist.
Keep track of the timezone information when collecting data with a timestamp. Does your equipment adjust for Daylight Savings? Document that. Collecting data in Pacific Standard Time? Document that. Document it when installing or configuring your equipment, collecting data in the field or while using lab equipment that includes a timestamp.
Documentation
Basic Documentation
The data publication documentation is not very different from putting together a poster. Very simply, tell us about your research.
- Who collected and analyzed the data?
- What did you do? What equipment did you use? What types of analysis did you perform?
- When did you collect or generate the data?
- If it involved field work, where did you collect the data?
- What did you collect and what are the units for each kind measurement?
The documentation process should also be a continuous one as you work on your research project. If you have questions about documentation with regards to your specific research project, please get in touch with EDAC at any time.
It is helpful to consider one of the following scenarios when developing the documentation for your datasets:
You are a new member of the research group and have been tasked with performing additional analysis on an existing dataset. Is there anything describing the file and its contents? Does it contain enough information for you to understand what’s in each field? Does it contain enough information for you to be confident that the analysis you want to perform is possible or appropriate? If you were asked to recreate the dataset, would you be able to given the documentation provided?
You move on to other research tasks, but return to these datasets after a year or two. Is there anything describing the file and its contents? Does it contain enough information for you to understand what’s in each field? Does it contain enough information for you to be confident that the analysis you want to perform is possible or appropriate? If you were asked to recreate the dataset, would you be able to given the documentation provided?
Everything in your data files should be documentable and documented. This does not mean that you maintain a separate documentation for each file; for example, doing this for data collected at regular intervals from an instrument is unnecessary. For similar datasets, a single piece of documentation is sufficient. However, “similar datasets” does not mean a set of files that were collected by different people that contain chemistry data. It means that for this set of files, the processes used to generate each file are the same, the structure of the files is the same in every way, the files relate to the same set of locations (if spatial data) and the kind of data generated is the same (each file contains the same weather parameters and is structured the same way).
If you have instituted good documentation practices throughout the course of your research project, the NM EPSCoR data publication requirements are not difficult to complete. We have found that most students, when they take advantage of the work they’ve done for their thesis/dissertation or even for a poster, have very little trouble providing good documentation for publication.
Other Resources
Data Management Plans
- National Science Foundation - National Science Foundation - Data Management Plan Requirements (GPG): http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_2.jsp#dmp
- DataONE - Best Practices Collection: http://www.dataone.org/all-best-practices
Data Management
- Data Sharing Policy: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/aag_6.jsp#VID4
- Digital Curation Centre - Managing Research Data (video): http://www.dcc.ac.uk/news/managing-research-data-video
- Library of Congress - Sustainability of Digital Formats: http://www.digitalpreservation.gov/formats/index.shtml
- Federation of Earth Science Information Partners - Data Management Short Course materials: http://commons.esipfed.org/datamanagementshortcourse
- Federation of Earth Science Information Partners - Interagency Data Stewardship Principles: http://commons.esipfed.org/node/419
- DataONE - Best Practices Collection: http://www.dataone.org/all-best-practices
Documentation
- Digital Curation Centre - Disciplinary Metadata Standards Collection: http://www.dcc.ac.uk/resources/metadata-standards
- Library of Congress - Standards at the Library of Congress: http://www.loc.gov/standards/
- Library of Congress - Metadata for Digital Content - Standards: http://www.loc.gov/standards/mdc/
- Federation of Earth Science Information Partners - Data Citation Guidelines: http://commons.esipfed.org/node/308
- DataONE - Best Practices Collection: http://www.dataone.org/all-best-practices
Tools
- Digital Curation Centre - Managing Active Research Data - Tools and Services
- Digital Curation Centre - Open Source Software and Open Standards FAQ
Related information:
See Documentation Reporting (coming soon).
Additional resources:
- Digital Curation Centre - Disciplinary Metadata Standards Collection: http://www.dcc.ac.uk/resources/metadata-standards
- Library of Congress - Standards at the Library of Congress: http://www.loc.gov/standards/
- Library of Congress - Metadata for Digital Content - Standards: http://www.loc.gov/standards/mdc/
- Federation of Earth Science Information Partners - Data Citation Guidelines: http://commons.esipfed.org/node/308
- DataONE - Best Practices Collection: http://www.dataone.org/all-best-practices
NM EPSCoR contributes its data to the DataONE network as a member node. Through participation in this network, whose mission is to enhance search and discovery of Earth and environmental data, NM EPSCoR is able to reach a wider audience with and maintain high availability of its data.