7. Output file archiving

Status

Draft

Context

Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example:

In addition, we have a developer-facing callback for diagnostics, DocLoggingCallback.

The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we need to consider how these files are archived for the long term. This must align with the ISIS Data Policy. We should make an attempt to align with FAIR principles.

According to the definitions in the ISIS Data Policy, the data generated by bluesky is generally either “facility generated reduced data” or “metadata”.

This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure which is therefore used to keep these files for the long term.

At the time of writing this ADR, in June 2025, the scientist-facing files are being written to

...\inst$\<instrument>\user\bluesky_scans\<rb_number>\

This location has some disadvantages:

It is a network location, which means that a site network break will cause bluesky scans to fail to run
It is not a location designed for long-term scientifically useful data - for example in terms of data integrity
It is not necessarily accessible from downstream systems such as Topcat

Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written.

Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to expand each use case):

1 Bluesky scan, no neutron runs (e.g. scanning against a block)

        sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over PI: Time Passes
note over NDX: Bluesky scan ends
note over NDX: creates scan.ascii and scan.nxs
NDX ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.ascii and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.nxs

1 Bluesky scan, aborted neutron runs

        sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat as Online Catalogue
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
note over NDX: Bluesky scan ends
note over NDX: creates scan.ascii and scan.nxs
NDX ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.ascii and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.nxs

1 Bluesky scan, one neutron run

        sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over NDX: Bluesky scan starts DAE run
note over PI: Time Passes
note over NDX: Bluesky scan ends DAE run <br/> Bluesky scan ends
par
note over NDX: creates runnumber.nxs with DAE and SE data
and
note over NDX: creates scan.ascii and scan.nxs
end
NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs
TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs

1 Bluesky scan, N neutron runs

        sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over NDX: Bluesky scan starts DAE run
note over PI: Time Passes
note over NDX: Bluesky scan ends DAE run
note over NDX: creates runnumber.nxs with DAE and SE data
NDX ->> Archive: Sends runnumber.nxs
TopCat ->> Archive: Collects runnumber.nxs
note over PI: Time Passes
note over NDX: Bluesky scan starts DAE run
note over PI: Time Passes
note over NDX: Bluesky scan ends DAE run
note over NDX: creates runnumber+1.nxs with DAE and SE data
NDX ->> Archive: Sends runnumber+1.nxs
TopCat ->> Archive: Collects runnumber+1.nxs
note over NDX: Bluesky scan ends
NDX ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs

1 Bluesky scan, neutron/muon runs on multiple instruments

        sequenceDiagram
actor PI
participant NDX-A
participant NDX-B
participant NDX-C
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX-A: Start bluesky scan
NDX-A ->> NDX-B: Start DAE run
NDX-A ->> NDX-C: Start DAE run
note over PI: Time Passes
NDX-B ->> NDX-A: Provides summary run data
NDX-C ->> NDX-A: Provides summary run data
NDX-A ->> NDX-B: End DAE run
note over NDX-B: creates runnumberB.nxs with DAE and SE data
NDX-B ->> Archive: Sends runnumberB.nxs
TopCat ->> Archive: Collects runnumberB.nxs
NDX-A ->> NDX-C: End DAE run
note over NDX-C: creates runnumberC.nxs with DAE and SE data
NDX-C ->> Archive: Sends runnumberC.nxs
TopCat ->> Archive: Collects runnumberC.nxs
note over NDX-A: Bluesky scan ends
NDX-A ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs

Present

The following people have been involved in discussions leading up to this ADR:

Tom
Chris M-S
George
Kathryn
Jack H
CK (Reflectometry)

This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team.

Decisions

File-writing location

Bluesky should write data into the c:\data\RB<rb_number>\bluesky_scans\ folder during a scan. File naming itself will keep its current scheme (timestamped files).

This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT.

Attributes & checksums

Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the likelihood that a file is accidentally modified.

Checksums should be generated, either at the point when the data is initially generated, or by the archiving process just before it first copies or moves a file.

We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers. A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed are:

Use windows alternate file streams. This is how checksums are done in existing DAE .raw files. It has the advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file systems.
Generate one checksum per file, for example file.txt would also have an associated file.sha1.txt containing the checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles the number of files visible in the archive area.
Generate a single checksum file containing the checksums of all bluesky data, at a higher level of granularity (for example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what point these checksums would be moved to the archive.

Moving to the ISIS archive

An automated cron task will look for read-only Bluesky output files, and their associated checksums, in c:\data at regular short intervals (for example, 1 minute), and will move them to:

The ISIS data archive, under autoreduced/bluesky_scans. The autoreduced folder already exists on the archive.
The data cache disk on the instrument, under c:\data\Export only\RB<rb_number\bluesky_scans.

Data on the cache disk, under Export only, is kept on the instrument for a short period (usually 24 hours), and then deleted by existing processes.

This is run as a cron task so that, if the network happens to be unavailable at the time when a scan ends, the copy process will catch up when the network becomes available again. This cron task will only move files which sit within a bluesky_scans folder, to prevent it from interfering with other non-bluesky files.

Creating a new bluesky_scans folder alongside the existing autoreduced folder was considered, but was felt to be unachievable - it would require too much work relative to using the existing autoreduced folder.

File formats

At present, our scan file output format is explicitly designed to be “human-readable” (and, in fact, the callback which generates these files is explicitly called HumanReadableFileCallback).

We have issue 26 which will implement machine-readable files, using a format such as .hdf5 or .nxs. These files will sit alongside the existing human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without using special software.

Consequences

Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes data integrity and availability concerns.
Bluesky scans will no longer be reliant on a network location being available to run a scan
The initial location where bluesky writes data (c:\data\<rb number>) will not be the same as its final location (the autoreduced folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP.