7. Output file archiving
Status
Draft
Context
Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example:
In addition, we have a developer-facing callback for diagnostics,
DocLoggingCallback
.
The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we need to consider how these files are archived for the long term. This must align with the ISIS Data Policy. We should make an attempt to align with FAIR principles.
According to the definitions in the ISIS Data Policy, the data generated by bluesky is generally either “facility generated reduced data” or “metadata”.
This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure which is therefore used to keep these files for the long term.
At the time of writing this ADR, in June 2025, the scientist-facing files are being written to
...\inst$\<instrument>\user\bluesky_scans\<rb_number>\
This location has some disadvantages:
It is a network location, which means that a site network break will cause bluesky scans to fail to run
It is not a location designed for long-term scientifically useful data - for example in terms of data integrity
It is not necessarily accessible from downstream systems such as Topcat
Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written.
Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to expand each use case):
1 Bluesky scan, no neutron runs (e.g. scanning against a block)
sequenceDiagram actor PI participant NDX participant Archive participant TopCat note over PI:Start of RBNumber experiment PI ->> NDX: Start bluesky scan note over PI: Time Passes note over NDX: Bluesky scan ends note over NDX: creates scan.ascii and scan.nxs NDX ->> Archive: Sends scan.ascii and scan.nxs TopCat ->> Archive: Collects scan.ascii and scan.nxs note over PI: 5 months later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to scan.ascii and scan.nxs note over PI: 1 year later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to scan.nxs
1 Bluesky scan, aborted neutron runs
sequenceDiagram actor PI participant NDX participant Archive participant TopCat as Online Catalogue note over PI:Start of RBNumber experiment PI ->> NDX: Start bluesky scan note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run note over NDX: Bluesky scan ends note over NDX: creates scan.ascii and scan.nxs NDX ->> Archive: Sends scan.ascii and scan.nxs TopCat ->> Archive: Collects scan.ascii and scan.nxs note over PI: 5 months later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to scan.ascii and scan.nxs note over PI: 1 year later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to scan.nxs
1 Bluesky scan, one neutron run
sequenceDiagram actor PI participant NDX participant Archive participant TopCat note over PI:Start of RBNumber experiment PI ->> NDX: Start bluesky scan note over NDX: Bluesky scan starts DAE run note over PI: Time Passes note over NDX: Bluesky scan ends DAE run <br/> Bluesky scan ends par note over NDX: creates runnumber.nxs with DAE and SE data and note over NDX: creates scan.ascii and scan.nxs end NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs note over PI: 5 months later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs note over PI: 1 year later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs
1 Bluesky scan, N neutron runs
sequenceDiagram actor PI participant NDX participant Archive participant TopCat note over PI:Start of RBNumber experiment PI ->> NDX: Start bluesky scan note over NDX: Bluesky scan starts DAE run note over PI: Time Passes note over NDX: Bluesky scan ends DAE run note over NDX: creates runnumber.nxs with DAE and SE data NDX ->> Archive: Sends runnumber.nxs TopCat ->> Archive: Collects runnumber.nxs note over PI: Time Passes note over NDX: Bluesky scan starts DAE run note over PI: Time Passes note over NDX: Bluesky scan ends DAE run note over NDX: creates runnumber+1.nxs with DAE and SE data NDX ->> Archive: Sends runnumber+1.nxs TopCat ->> Archive: Collects runnumber+1.nxs note over NDX: Bluesky scan ends NDX ->> Archive: Sends scan.ascii and scan.nxs TopCat ->> Archive: Collects scan.ascii and scan.nxs note over PI: 5 months later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs note over PI: 1 year later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs
1 Bluesky scan, neutron/muon runs on multiple instruments
sequenceDiagram actor PI participant NDX-A participant NDX-B participant NDX-C participant Archive participant TopCat note over PI:Start of RBNumber experiment PI ->> NDX-A: Start bluesky scan NDX-A ->> NDX-B: Start DAE run NDX-A ->> NDX-C: Start DAE run note over PI: Time Passes NDX-B ->> NDX-A: Provides summary run data NDX-C ->> NDX-A: Provides summary run data NDX-A ->> NDX-B: End DAE run note over NDX-B: creates runnumberB.nxs with DAE and SE data NDX-B ->> Archive: Sends runnumberB.nxs TopCat ->> Archive: Collects runnumberB.nxs NDX-A ->> NDX-C: End DAE run note over NDX-C: creates runnumberC.nxs with DAE and SE data NDX-C ->> Archive: Sends runnumberC.nxs TopCat ->> Archive: Collects runnumberC.nxs note over NDX-A: Bluesky scan ends NDX-A ->> Archive: Sends scan.ascii and scan.nxs TopCat ->> Archive: Collects scan.ascii and scan.nxs note over PI: 5 months later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs note over PI: 1 year later PI ->> TopCat: Show me my data TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs
Present
The following people have been involved in discussions leading up to this ADR:
Tom
Chris M-S
George
Kathryn
Jack H
CK (Reflectometry)
This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team.
Decisions
File-writing location
Bluesky should write data into the c:\data\RB<rb_number>\bluesky_scans\
folder during a scan.
File naming itself will keep its current scheme (timestamped files).
This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT.
Attributes & checksums
Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the likelihood that a file is accidentally modified.
Checksums should be generated, either at the point when the data is initially generated, or by the archiving process just before it first copies or moves a file.
We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers. A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed are:
Use windows alternate file streams. This is how checksums are done in existing DAE
.raw
files. It has the advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file systems.Generate one checksum per file, for example
file.txt
would also have an associatedfile.sha1.txt
containing the checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles the number of files visible in the archive area.Generate a single checksum file containing the checksums of all bluesky data, at a higher level of granularity (for example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what point these checksums would be moved to the archive.
Moving to the ISIS archive
An automated cron task will look for read-only Bluesky output files, and their associated checksums, in c:\data
at
regular short intervals (for example, 1 minute), and will move them to:
The ISIS data archive, under
autoreduced/bluesky_scans
. Theautoreduced
folder already exists on the archive.The data cache disk on the instrument, under
c:\data\Export only\RB<rb_number\bluesky_scans
.
Data on the cache disk, under Export only
, is kept on the instrument for a short period (usually 24 hours), and then
deleted by existing processes.
This is run as a cron task so that, if the network happens to be unavailable at the time when a scan ends, the copy
process will catch up when the network becomes available again. This cron task will only move files which sit within
a bluesky_scans
folder, to prevent it from interfering with other non-bluesky files.
Creating a new bluesky_scans
folder alongside the existing autoreduced
folder was considered, but was felt to be
unachievable - it would require too much work relative to using the existing autoreduced
folder.
File formats
At present, our scan file output format is explicitly designed to be “human-readable” (and, in fact, the callback which
generates these files is explicitly called
HumanReadableFileCallback
).
We have issue 26 which will implement
machine-readable files, using a format such as .hdf5
or .nxs
. These files will sit alongside the existing
human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and
archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without
using special software.
Consequences
Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes data integrity and availability concerns.
Bluesky scans will no longer be reliant on a network location being available to run a scan
The initial location where bluesky writes data (
c:\data\<rb number>
) will not be the same as its final location (theautoreduced
folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP.