Digital Rocks Portal (Digital Porous Media): connecting data, simulation and community

. Digital Rocks Portal (DRP, https://www.digitalrocksportal.org) organizes and preserves imaged datasets and experimental measurements of porous materials in subsurface, and beyond, with the mission to connect them to simulation and analysis, as well as educate the research community. We have over 150 projects represented in more than 200 publications, and an active community that reuses the data, most recently in multiple machine learning applications for automating image analysis as well as the prediction of transport. Such automation is crucial for performing formation evaluation tasks in near-real time. We present benchmark datasets that have played a role in recent machine learning prediction successes in the field. We further discuss the vision for further research advances, educational materials, as well as growth and sustainability plan of this digital rock physics community resource. In particular, we are in the process of expanding into a broader repository of engineered porous materials, specifically those for energy storage and the portal will transition to Digital Porous Media (DPM) in near future.


Introduction
The field of digital rock physics combines imaged porous media (such as rocks and soil) to simulations and other research methods such as machine learning (ML) and data analysis to understand transport and deformation in subsurface.Applications include formation evaluation, management of groundwater resources, carbon sequestration, enhanced oil recovery, and contaminant transport [1].Typical geological systems are composed of a broad spectrum of porous media with properties such as permeability varying by orders of magnitude within an individual system.The advances in high-resolution imaging techniques (x-ray tomography , scattered electron and optical microscopy) have provided a wealth of 2D and 3D datasets that reveal the microstructure of rocks, soil or model media on scales ranging from nanometers to centimeters, alas analysis and use of such images in simulation is computationally demanding [2].Our practical questions nevertheless require upscaling to the field scale.Therefore, efficient connection of curated digital rocks data and scalable computing and machine learning (ML) applications is necessary for applications [3].
We here describe the organization of Digital Rocks Portal (DRP), a well-adopted portal for curation and visualization of large imaged datasets of geomaterials.DRP's mission is to organize and preserve imaged datasets and experimental measurements of porous materials in subsurface, and then connect them to research methods.We report on our efforts in educating the research community by organizing workshops and mini-courses in data preservation, analysis and visualization.
Deep learning methods and convolutional neural networks (CNNs) have many applications in image processing and flow prediction [4].For this, reliable "ground truth" benchmark datasets are needed because (1) supervised deep learning methods require a large amount of validated data to train models; and (2) the capabilities of the trained classifiers must be assessed quantitatively.Applying learning algorithms to the large volume of data stored within DRP is currently rendering scientific advances.We successfully applied deep learning algorithms on predicting 3D velocity fields in porous media images [5,6].The implementation was done in TensorFlow [7] using high performance resources (HPC) at Texas Advanced Computing Center (TACC).The work in [5] sampled a wide variety of images from DRP as well as adding a benchmark collection [8], has specifically enabled deep learning training with geologic porous media whose heterogeneity (including fractures/cracks) likely surpasses any engineered porous materials, and, should data from energy storage materials become available, the results should immediately transfer.This exemplifies a workflow that can enable future real time permeability estimation based on imaged datasets, as the computational time of a deep learning prediction is in seconds compared to hours on HPC.
The parallel work at this conference [9] is adapting the algorithm for electrical resistivity prediction.
Visualization algorithms of cross-sections, volumes and surfaces of 3D data (e.g.marching cubes algorithm) are wellknown and available in a number of advanced visualizing platforms (e.g.ParaView, MayaVi, ImageJ, Dragonfly, to name a few that are open source).However, the complexity of the pore structure and processes captured within pose a challenge in visualizing the data, both from memory and computational perspective.That said, while this challenge is meant to reuse data from geosciences and subsurface engineering, all of the educational concepts are very much applicable to any scientific visualization endeavor: for instance, in biomedical field, heart or blood vessels in a human body, are often imaged using computed tomography (or simulated based on those images) and have particulate flow (blood flow, that is) within.In addition to the materials from the recent visualization contest reusing DRP datasets [10], we provide additional general input/output tools and Jupyter Notebooks examples visualizing a 3D velocity field as well as porous media surfaces in multi-fluid situation exemplified in this paper [11].
Transitioning to renewable energy from the hydrocarbonbased sources that currently meet 65% of energy demands will require significant engineering effort on several fronts: (1) mitigating the effects of air pollution with solutions such as subsurface carbon dioxide sequestration; and (2) scaling up renewable energy based on the maturation of advanced energy devices, geothermal, and other solutions.Complex microstructure is a common element for many of these physical problems.We discuss our sustainability plan that includes the paid membership fees scheme for Digital Rocks Portal, thus ensuring its financial stability, (2) strengthening our visualization and analysis services and (3) generalizing and broadening the user base of DRP to Digital Porous Media (DPM) portal to energy storage materials.Geometric approaches developed to enhance digital rock images have excellent crossover potential to other complex material structures [12].
High performance methods to enhance experimental data collection workflows are also broadly generalizable [13].

Portal organization
Launched in September 2015, DRP [14] is an established data repository, trusted by the community.DRP is specialized for porous media as opposed to platforms such as Mendeley Data [15], Dryad [16] or Energy Data Exchange [17] that store data of any origin.The portal is specifically equipped with tools tailored to organize, visualize, analyze, and publish digital images of porous materials.Its data model shown in Fig. 1. is the foundation for curating, describing and displaying the datasets.The data model enables imaged data curation workflows including adding provenance information and metadata used for rendering the images, and organizing the dataset in relation to its corresponding research activity such as observations, experiments, and simulations.Thus, the model is the basis for the datasets representations on the landing pages that displays provenance as material entities (samples) and processes (experimental or simulation) from which data derives.This results in a tree like representation that is visible on the landing page of any project, which was evaluated through user studies [18] and facilitates understanding of and access to complex and large datasets [19].Example projects are shown in Fig. 2. and Fig. 3.   DRP does not assume a specific image format at this point.This is because subsurface porous media heterogeneity requires a variety of imaging modalities ranging from microscopy to x-ray microtomography to medical computed tomography with no commonly accepted data formats.As a result, DRP has built-in capabilities for translating and visualizing raw data after import, which distinguishes it from other imaged data repositories.Fig. 4. Part of the metadata collected for an example dataset [22] including voxel length that relates the image to sample physical size.
Data curation is assisted through the publication pipeline interface that carefully guides users through the process.Metadata collected during this process is essential for full documentation of the dataset.Images by themselves contain structural information collected at a certain time, but not more than that and this step is essential.Fig. 4. and Fig. 5. exemplify this information for an example project.Note that only part of the information collected is shown for brevity.Prior to publication and before assigning a digital object identifier (DOI), the datasets are verified for completeness and accuracy by a live curator, again unlike many data portals that allow upload of compressed archives.Current list of curators includes the majority of the authors of this publication and we are open to volunteers.DOIs are provided through The University of Texas Libraries subscription to DataCite Fabrica [23].[22]) that are part of of the same experiment (as proved by the image cross-sections.Action button allows download, shows a video of all slices (in 3D), and basic image statistics such as histogram."Action" button also has a button that allows connection to HPC cluster on TACC that opens ParaView for visualization.This software stack is however currently being updated.
DRP is implemented within the reliable HPC environment deployed and maintained by Texas Advanced Computing Center, and data storage is supported by The University of Texas System Research Cyberinfrastructure (UTRC) in the storage resource Corral.This infrastructure assures data security and persistence, interoperability with different HPC systems for visualization and analysis (TACC Analysis Portal, [24] currently on Lonestar6 and Stampede2 clusters) and continuing operation of the software stack, and thus it contributes to the portal's sustainability.

Metrics
As of June 2022, DRP has published 154 projects and 200 related publications [25], ranging in size from 10MB to 728GB.DRP accepts international contributions.Our current query country of the institutional affiliation of the first author of the current DRP projects we find 48% from the U.S., 18.2% from UK, 9.5% from Australia, and the reminder from China, Belgium, Brazil, Canada, South Korea, Japan, Russia and Switzerland.
Since its launch, we have been keen in promoting DRP and in understanding its impact.We implemented Search Engine Optimization (SEO) strategies such as using schema.orgvocabularies and Google-friendly metadata in the repository landing page.Using Google Analytics, we monitor an increase in total visits over the years as well as consistent spikes in usage, which coincide with academic events that we undertake (workshops, presentations, publications).We also implemented a data download count (of individual datasets) to each of our dataset pages: it is available under link "Usage Information" embedded in each project landing webpage.Through Google Scholar Alerts we follow mentions and citations to the repository and to individual datasets.

Digital Object Identifiers (DOIs) and increasing visibility of data
Journal publications have had DOIs for a long time, but providing them for research data or code is a relatively new adoption thanks to efforts such as Findable, Accessible, Interoperability, and Reusable (FAIR) data efforts launched in 2016 [26].We have found that authors have variable or incomplete referencing styles for datasets in papers, whether they are referencing their own data or reusing someone else's.Many, for instance, include only the name of the platform where the data is posted.Note that including the dataset DOI and a proper reference improves the findability, as search engines presently do not find data as efficiently as they do papers (if at all).One reason for this is that datasets come in variable formats and are not easily searchable for information the way .pdfor text formats are.We thus next describe strategies that readers can use to improve other authors finding their data.
Increasingly, journals request that you publish your paper along with the data that supports your findings.For example, the AGU journals follow this FAIR data practice.Curating data for publication demands significant efforts on the part of the researchers that have to organize and describe it so that others can recreate their results or reuse the data for a different purpose.At the same time, data publications can bring significant rewards in terms of citations and exposure of your work.To help you boost the discoverability and consequently the reuse of your data there are some basic strategies that you can follow: To enhance discoverability: • Add all the information (metadata) that DRP asks for, including your ORCID and proceedings that discuss the data as related publications.After your dataset is published you can request DRP curators to add your new publications by sending an email to masha@utexas.edu.

Newsletters, project highlights and educational material
Engaging the research community is necessary if the data will become reused and thus actively contribute to science and engineering.This can be done through (classical) conference presentations, but also contests, newsletters and social media activities that promote data sharing, and reuse, data cyberinfrastructure, and overall data education through the published materials.
To that end, we launched newsletters in October 2020 and have since published them at the rate of two to three per academic semester.See [25] for links to all newsletters as well as links to a various news articles written about DRP.YouTube playlist on Digital Rocks Portal Visualization [28] contains various visualizations of the datasets, as well as a growing list of interviews we recorded to highlight projects on DRP.
Data reuse contests is another activity whose goal was to stimulate engagement and demonstrate the research potential of these datasets.organized a successful a virtual visualization mini-course, followed by a visualization challenge in early 2021 with three categories (video, static image and 3D printed porous microstructure).This is described in the next section.

Visualization mini course and contest
In October 2020, a porous media visualization competition was organized which consisted of a mini course (taught using Jupyter Notebooks), followed by a challenge with three categories with monetary awards to promote data reuse and create visualization templates for porous materials (known for their complexity).The task was to reuse any 3D dataset from the DRP to create a static image, video, or 3D printed visualization.Porous media are challenging to visualize due to complexity of pore/grain/fluid surfaces and interfaces, and this creates a resource for those who wish to learn advanced 3D visualization.Resources and materials were stored in DRP to help the participants.The events were sponsored by Southern Big Data Hub, Object Research Systems, Kitware and Dassault Systèmes.Tables 1 through 3 show the list of projects and the datasets that were reused from DRP.

Benchmark datasets
In order to push the state of the art for machine learning (ML) forward and provide comparisons between ML techniques, it would be very useful to have a large number of labeled data (the results of expensive full-physics simulations) to build models that work in the complexity of real-world pore

Visualization examples
We created two workflows for 3D visualization of surfaces and vector (velocity) fields and publish them in conjunction with this paper on GitHub [10,11].The workflows download and use the datasets from the portal.In the examples that follow we relied on Python-based PyVista visualization software and provide both regular Python code (that is somewhat more robust, as it allows PyVista to open a separate window with the plot) as well as Jupyter Notebook.Jupyter Notebooks are excellent teaching tool and embed the results into web browser.That makes them versatile, but an extension is needed to make it work.The figures below show a glimpse into what is available within the worksheets; they can be adjusted to any other data on DRP.
Notably, while these worksheets are currently set up to download the data and visualize it, they also work on TACC Visualization and Analysis portal [24], in particular on Lonestar6 HPC environment that can access DRP data storage directly without any download, and can analyze large datasets beyond the capability of an individual workstation.Fig. 6.Subset of the entire DRP velocimetry dataset from Fig. 2 visualizing absolute values of velocities in PyVista.See [11] Fig. 7. Surface visualization in PyVista of an oil blob (in transparent green) within a Ketton limestone sample (transparent gray).The high resolution x-ray microtomography experimental data [38] was used in the example and the visualization code is available [11].

Related software
The data in Digital Rocks Portal can be characterized and visualized using various data analysis software, and previous section provides one example.Beyond that, open source software such as ImageJ, ParaView, or DragonFly is available, as well as commercial options (e.g.MATLAB, PerGeos).
Transport properties of porous materials, however, require simulation with different type of solvers.If segmented (binarized) images are available, then images can be direct input into simulation that produces a scalar or vector field of interest.The following table includes some open source example libraries.open data, neither is unfortunately free and a business model has to ultimately be developed for sustainability.The portal is on the way to scale up to a go-to location for benchmark datasets, and as such cannot be maintained in an ad-hoc manner and requires consistent financing to provide required services.

User fees
We formulated a business plan and begin adoption in 2021.Until that point users publishing data in DRP where not charged any fees as initial development (2015-2018) was supported by an NSF grant, and to this day rather limited support is available through Digital Porous Media Industry Affiliate Program at The University of Texas at Austin.In May and June 2021 DRP initiated a request for comments about establishing a funding model based on a) one-time publication fees or b) a DRP subscription plan to start in September of 2021.The community discussion was initiated via the DRP newsletter to all users (available on [25]) as well as through personal and email communications to the biggest DRP data providers.Documentation on DRP user agreement changes [47] has complete details of the proposed fee structure.Responses were publicly posted without personal identifiable information in DRP newsletter from July 13, 2021, linked to [25].Five groups that publish regularly expressed support for the fee structure after the necessity was explained.The push back from the rest of the researchers came from the fact that they do not have open data fees built into their budgets.The fee structure is in Table 2 and the new user agreement is available on [14].

Expanding the user base
Last but not the least, many energy storage and materials engineering samples are porous and require image analysis, simulation and ML techniques that are comparable to digital rocks physics.Technology transfer and change of the name to Digital Porous Media is imminent.

Fig. 2 .
Fig.2.Example DRP project 3D experimental fluid velocity field within a sphere packing (model granular or soil structure) obtained particle velocimetry [20].To our knowledge, this is the only open 3D velocimetry dataset, and has been the bases for the winning visualization entry in DRP visualization contest (see below).

Fig. 3 .
Fig.3.Example DRP project [21] with 3D x-ray microtomography images as well as XRF mineral maps of a partially mineralized vein contained from a core sample (depth of ~10,400 ft) in the Upper Wolfcamp formation.The sample was studied as part of the effort of the Department of Energy's Center for Nanoscale Controls on Geologic CO₂.

Fig. 1 .
Fig.1.DRP Data model is foundation for the database linking web content and files on HPC storage.

Fig. 5 .
Fig.5.Second part showing successful "ingestion" of two datasets of the same project as shown in Fig.4.([22]) that are part of of the same experiment (as proved by the image cross-sections.Action button allows download, shows a video of all slices (in 3D), and basic image statistics such as histogram."Action" button also has a button that allows connection to HPC cluster on TACC that opens ParaView for visualization.This software stack is however currently being updated.

E3S
Web of Conferences 367, 01010 (2023) https://doi.org/10.1051/e3sconf/202336701010SCA 2022 [27] ID (that uniquely identifies you in case you change affiliation or someone else shares your name) if you have it, and who funded your project.• In your data description use terms that you consider people will search for when looking for this type of data.Highlight the unique characteristics of your dataset and suggest how your data could be useful to others and for what purposes.• All these metadata are indexed by search engines such as Google, improving the possibilities that your data is discovered by broader audiences.To facilitate data reuse: • Add documentation in the form of readme files or reports to clearly explain how you obtained and processed the data.This stimulates the understandability of your publication and consequent data reuse.• You can include code snippets as "Non-image data" to your datasets, or better yet, publish the code on GitHub and add it as a related publication in DRP.• If you have them prior to publishing the data, add papers

7
Sustainability modelSustainability is understood as a combination of technical and administrative strategies to ensure that users can rely on ongoing data infrastructure services that promote research innovation.While we support open science and specifically

Table 1 .
Winners of the static image category

Table 2 .
Winners of the video category

Table 3 .
Winners of the 3D print category

Table 4 .
Example open source simulation software using DRP data.

Table 5 .
The user fee structure implemented in September 2021.

Table 6 .
Donation and sponsorship levels.