DigitalGlassware™: Open Data Standards, As Standard

Welcome to the second blog post demonstrating how the data recorded in DigitalGlassware ™can be processed offline for additional purposes. In our last blog post we look at PCML (Practical Chemistry Markup Language), and tentatively extracted some information from a chemical recipe. In this post, we will focus on highlighting to you, the user, that the data we collect is robust and fit for purpose. Given that we use this data ourselves in DigitalGlassware™, it is naturally something we feel strongly about.

Here at DeepMatter, data is our business, and as such we’re committed to open data standards. This is why we’re laying out our stall showing that we adhere to standards for integrity and security, as we know that it is critical companies have confidence in the data specifications they use. Our commitment to this transparency allows you, the DigitalGlassware™ user, to get more out of your data.

This blog post starts at the beginning, with the raw data itself. We will access XML data associated with a real chemistry experiment; ingest it into Python; validate and verify the schemas and X.509 signatures, before showing how we can easily transform for other purposes.

Environment Setup

Just as in the previous blog post, we will use Python to extract information from the data. If you did not run through this first post, it may be worthwhile having a look at it, as it contains the environmental setup process. Once you have your environment setup (or if you are continuing on from the previous post), you can continue on.

First, open your terminal/command line and navigate to your dm_datascience Git repository. Running the following command will update your local repository with any changes which have been made on Github.

1 git pull

There should be no issues raised when doing this, but if you find that the process has failed for some reason, it is probably easiest to just delete the directory and clone the full repository again. Alternatively, you can tell Git to ignore all the changes you have made and overwrite the files.

1 git reset --hard

2 git pull

Now update your pipenv/Python environment by running the following command

1 pipenv install

2 pipenv shell

Then start a Python session by executing

1 python

As with the previous blog post, there is a Jupyter notebook available under dm_datascience/02_PCML_and_PCRR/notebooks/standards.ipynb if you would rather not copy and paste the code below. Last, but not least, we’ll need to unzip some of the XML data used in this post, located under dm_datascience/02_PCML_and_PCRR/data/pcrr.zip. Just extract it to the same directory using your favourite unzip tool (you’ll understand why we compressed it when you see the file sizes).

The Data

Data in DigitalGlassware comes primarily from two sources: recipe data, which provides context on what the intent of the chemistry is (“take x, do y, make z”), and runtime data, which represents the observations associated with chemistry. This covers (but is not limited to) real-time sensor data from hardware devices such as Deepmatter’s DeviceX, manual observations recorded by the chemist, and outcomes of the chemistry.As stated above, we are committed to standards in DigitalGlassware, with both recipe and runtime data adhering to open XML standards. We also offer access to our data via REST endpoints (another open standard), but for now we’ll make use of offline data provided in the Git repository (which you should have unzipped as per the instructions above). For recipe data, we have PCML (Practical Chemistry Markup Language), and for runtime data we have PCRR (Practical Chemistry Runtime Record). PCML defines what you intend to do (and possible what you expect to see), and PCRR defines what you did and saw. The following is a snippet of some typical PCML, located under dm_datascience/02_PCML_and_PCRR/data/3a_recipe.xml (it will look familiar if you’ve read the first blog post).

1 <?xml version="1.0" encoding="UTF-8"?>

2 <pcml version="1.3.4" experiment="3a) Synthesis of N-(1-Naphthoyl)-4-methylbenzenesulfonohydrazide - 3a"

3     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://schemas.deepmatter.tech/pcml-1.3.4.xsd">

4     <description></description>

5     <meta>

6         <owner>Deepmatter</owner>

7         <author>Deepmatter</author>

8        <version>1</version>

9        <version-author>Deepmatter</version-author>

10        <version-owner>Deepmatter</version-owner>

11        <publication>Org. Synth. 2018, 95, 276-288</publication>

12        <custom-tag>3a</custom-tag>

13        <creation-date>13/09/2018</creation-date>

14    </meta>

15    <synopsis>Synthesis of N-(1-Naphthoyl)-4-methylbenzenesulfonohdrazide</synopsis>

16    <chemistry>

17        <reaction-scheme>

18            <inputs>

19                <chemical vol="11.8" volunit="ml" step="synthesis" role="starting-material" moles="0.0788" molweight="190.63" >

20                    <name>1-Naphthoyl chloride</name>

21                </chemical>

22...

We’ve provided some example recipe (PCML) and runtime (PCRR) data under the dm_datascience/02_PCML_and_PCRR/data directory. We’ll start by checking the data integrity – that it is well-formed and that we adhere to our own standards, by validating the XML against the associated schemas.

Validating PCML

We’ll use the lxml library in Python to do validation of the PCML schema. This will have been installed by pipenv, but it can be a bit daunting to those unfamiliar with it, so we’ve wrapped up the necessary functionality in our own Python module located under dm_datascience/dm/data/xml.py. Let’s import the XML validation function and define where our PCML and associated XSD schemas are located.

1 from dm.data.xml import validate_xml

3 pcml_schema_file = './02_PCML_and_PCRR/data/pcml-1.3.4.xsd'

4 pcml_recipe_file = './02_PCML_and_PCRR/data/3a_recipe.pcml'

The PCML file represents a real chemical reaction, which has been authored for DigitalGlassware. Specifically, it is the synthesis of N-(1-Naphthoyl)-4-methylbenzenesulfonohydrazide, which appeared in Organic Synthesis 2018, volume 95, 276-288 (DOI: 10.15227/orgsyn.095.0276). This article has been encoded in DigitalGlassware using the unique content structuring tools available in the RecipeBuilder module, which is commercially available. This write-once, run-many philosophy is key to the DigitalGlassware platform. With it, once a recipe is written it can be run by anyone, anywhere, and the runtime data collected for analysis during and after the reaction.

Validating it against our PCML XSD schema is simple, and should pass without issue. If this fails for any reason, an exception will be thrown.

1 validate_xml(pcml_recipe_file, pcml_schema_file)

2 print("XML for '{}' adheres to schema '{}'".format(pcml_schema_file, pcml_recipe_file))

Output:

1 XML for './02_PCML_and_PCRR/data/pcml-1.3.4.xsd' adheres to schema './02_PCML_and_PCRR/data/3a_recipe.pcml'

Success! By not raising an exception during parsing, lxml has also checked that the XML is well-formed.

Note, we have modified our PCML and PCRR's <xs:import> tag to refer to the XSDs located in the data directory, as opposed to the standard w3.org address, as this tends to timeout during the validation calls. If you want the original versions, see https://schemas.deepmatter.tech/pcml-1.3.4.xsd and https://schemas.deepmatter.tech/schemas/pcrr-0.0.1.xsd. Regardless, our validation passed. Now let’s do the same with our 3 PCRR files, which represent what was observed during 3 repetitions of the experiment encoded in the PCML.

Validate PCRR

The PCRR of an experiment is a single record embodying the entire reaction process from before, during and after execution in the laboratory. The PCML of the original recipe is embedded in the PCRR too, but now has context on its execution, with timestamped observations of how the recipe progressed, responses to questions embedded in the recipe (“the recipe requires 3.2mg of x, how much did you really add?”), outcomes such as yield and purity, and most critically, the sensor data captured by all the hardware devices which were active.

The following is an example of some snippets contained within a PCRR file. The files themselves are very large due the verbose nature of XML, but this is needed to demonstrate the adherence to open standards on security and validity. In future, we will demonstrate how the same data can be represented more succinctly in JSON for analysis, but for now let’s continue proving that we meet our own standards.

1 <pcrr xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://schemas.deepmatter.tech/schemas/pcrr-0.0.1.xsd" id="ac110006-68ef-1127-8168-ffd03c84007d">

2   <status>completed</status>

3   <start_time>2019-02-18T08:54:24.000+01:00</start_time>

4   <end_time>2019-02-19T10:04:59.000+01:00</end_time>

5   <user user_id="ac110006-6775-158b-8167-976bb52800b3" username="16">

6     <first_name>Deepmatter</first_name>

7     <last_name>Deepmatter</last_name>

8     <email>q@deepmatter.io</email>

9     <skill_level>Not Available</skill_level>

10   </user>

11   <telemetry>

12     <dc id="aa1f6aaa-79ca-4971-b0b2-3f84e0b6e826">

13       <name>gp200067</name>

14     </dc>

15     <fume_hood id="ac110006-68a4-10a3-8168-b866266400aa">

16       <name>Fumehood1</name>

17     </fume_hood>

18     <lab id="ac110006-68a4-10a3-8168-b865f6e100a9">

19       <name>Deepmatter</name>

20     </lab>

21     <location id="ac128c24-5f96-19ed-815f-9692a390000d">

22       <name>Glasgow</name>

23       <latitude>55.873543</latitude>

24       <longitude>-4.289058</longitude>

25     </location>

26     <organisation id="ac128c24-5f96-19ed-815f-9692a38f000c">

27       <name>CRG</name>

28     </organisation>

29   </telemetry>

30 ...

31             <sensor_data name="irObjTempI">

32             <sensor_data_record>

33               <timestamp>2019-02-18T09:56:42.056+01:00</timestamp>

34               <value>16.71792</value>

35             </sensor_data_record>

36             <sensor_data_record>

37               <timestamp>2019-02-18T09:56:42.933+01:00</timestamp>

38               <value>16.706354</value>

39             </sensor_data_record>

40             <sensor_data_record>

41               <timestamp>2019-02-18T09:56:43.859+01:00</timestamp>

42               <value>16.706354</value>

43             </sensor_data_record>

44             <sensor_data_record>

45               <timestamp>2019-02-18T09:56:44.737+01:00</timestamp>

46               <value>16.694818</value>

47             </sensor_data_record>

48               <operation_result operation_id="cf0951d9-774b-4c83-9e26-4f80c649e89a">

49 ...

50       <recipe_run_progress>

51         <completed>false</completed>

52         <percent_completed>95.2381</percent_completed>

53         <start_date_time>2019-02-19T10:04:55.000+01:00</start_date_time>

54         <end_date_time>2019-02-19T10:04:55.000+01:00</end_date_time>

55         <sensors/>

56       </recipe_run_progress>

57       <interrogation_results/>

58       <notes/>

59    </operation_result>

60 ...

Before we take a brief look at the recorded data (more on that to come in later blogs), let’s repeat the validation and verification process we did with PCML.

1 pcrr_schema_file = './02_PCML_and_PCRR/data/pcrr-0.0.1.xsd'

2 pcrr_recipe_files = ['./02_PCML_and_PCRR/data/3a_run_01.pcrr',

3                      './02_PCML_and_PCRR/data/3a_run_02.pcrr',

4                      './02_PCML_and_PCRR/data/3a_run_03.pcrr']

6 for rr in pcrr_recipe_files:

7     validate_xml(rr, pcrr_schema_file)

8     print("XML for '{}' adheres to schema '{}'".format(rr, pcrr_schema_file))

Output:

1 XML for './02_PCML_and_PCRR/data/3a_run_01.pcrr' adheres to schema './02_PCML_and_PCRR/data/pcrr-0.0.1.xsd'

2 XML for './02_PCML_and_PCRR/data/3a_run_02.pcrr' adheres to schema './02_PCML_and_PCRR/data/pcrr-0.0.1.xsd'

3 XML for './02_PCML_and_PCRR/data/3a_run_03.pcrr' adheres to schema './02_PCML_and_PCRR/data/pcrr-0.0.1.xsd'

Now that we are happy the PCML and PCRR adhere to their respective XML schemas, let’s confirm that the data represents a complete and syntactically accurate record of the chemistry, and that it has not been interfered with by a third party after creation.

Verifying X.509 Signatures

Given that chemistry data tends to be more than a little sensitive, we recognize that you will want to make sure what was recorded to the cloud is exactly the same as the XML you received. That’s why we support signing of PCML and PCRR using X.509 certification – the same open standard which your web browser uses for site security. We’ve started building in X.509 security to PCML and PCRR at the point of REST call, but for this post we are just using a self-signed certificate which has been created for the purposes of demonstrating the process.

We’re building X.509 support into DigitalGlassware for the following reasons:

  1. To prove the integrity of the record generated by DigitalGlassware.
  2. To prove that the recipe run was generated by a certain person and organization.
  3. To allow future co-signing by peers when reviewing the recipe (PCML) and run (PCRR) in industrial and academic settings.

For verifying the signatures in the XML, we’ll be using signxml. Just as with the XML schema validation there is a verification function wrapped up in our dm.data.xml module. In addition to that, let’s define another bespoke function which provides some more verbose output.

1 from dm.data.xml import verify_x509_signature

2 from signxml import InvalidSignature

4 def verify_sig(xml_file, cer_file):

5     try:

6         verify_x509_signature(xml_file, cer_file)

7         print("X.509 signature of '{}' is valid as generated by DigitalGlassware".format(xml_file))

8     except InvalidSignature as e:

9         raise ValueError("X.509 signature for '{}' is invalid -- it may not have been generated by DigitalGlassware, or has been modified after signing.".format(xml_file)) from e

10     except Exception as e:

11        raise Exception("An error occurred whilst verifying X.509 signature of '{}'".format(xml_file)) from e

Now verify the X.509 signature embedded in the PCML.

1 pcml_cer_path = "./02_PCML_and_PCRR/data/x509/pcml.crt"

2 verify_sig(pcml_recipe_file, pcml_cer_path)

Output:

1 X.509 signature of './02_PCML_and_PCRR/data/3a_recipe.pcml' is valid as generated by DigitalGlassware

Do the same for the 3 PCRR files. WARNING: The PCRR files are very large, and the verification process will happily consume 10GB of memory, so only run the following snippet if you have enough memory free. It will also probably take a minute or two per PCRR to verify.

1 pcrr_cer_path = "./02_PCML_and_PCRR/data/x509/pcrr.crt"

3 for rr in pcrr_recipe_files:

4    verify_sig(rr, pcrr_cer_path)

Output:

1 X.509 signature of './02_PCML_and_PCRR/data/3a_run_01.pcrr' is valid as generated by DigitalGlassware

2 X.509 signature of './02_PCML_and_PCRR/data/3a_run_02.pcrr' is valid as generated by DigitalGlassware

3 X.509 signature of './02_PCML_and_PCRR/data/3a_run_03.pcrr' is valid as generated by DigitalGlassware

Now that we’re sure we have valid and verified data, let’s demonstrate that the information embedded in the XML can be easily transformed for other purposes.

Transforming PCML

One of the things you may want to do with PCML is transform it from elegant but verbose form into a more palatable one for reporting or general visualisation. This can be easily achieved through applying a XSLT template to the raw PCML and rendering the output to a web browser. As a final step, let’s visualise the flow of the recipe as a linear series of steps (this is actually something we use ourselves in DigitalGlassware’s RecipeBuilder module).

1 from dm.data.xml import transform as pcml_transform

3 pcml_xslt_flow_file = "./02_PCML_and_PCRR/data/flow.xsl"

4 pcml_transform(pcml_recipe_file, pcml_xslt_flow_file, outfile = "./02_PCML_and_PCRR/out/3a_pcml_flow.html")

This will convert the recipe PCML to text as an HTML webpage located at dm_datascience/02_PCML_and_PCRR/out/3a_pcml_flow.html which you can open in your favourite browser. It will look like (a longer version of) the following:

Of course in practice you would be welcome to apply your own company’s XSL transforms for reporting and visualization purposes.

Conclusion

This has been a brief introduction to the data captured by DigitalGlassware. In the next post we’ll look at how we can use the same data presented in this blog post in a more tractable JSON format. We will use it to perform some multi-reaction analysis and interrogate the data for interesting features/correlations, which could only have been achieved using tedious manual recording without DigitalGlassware. If you would like to know more about DigitalGlassware, you can register for updates, or contact us at enquiries@deepmatter.io.