Data access and data sharing are critical to businesses today. Increasingly, automated data pipelines are playing a key role. They help address common challenges and ensure the growing legions of data consumers within an organization get assured and easy access to the vast volumes of existing and newly generated data. But what does the future hold? One way to get a peek into the future is to look at what the national labs are doing.
Why? For more than two decades, nearly every advance related to computing and large datasets was driven by the major academic computer centers and national labs from around the world. Everything that is now mainstream in businesses, including multi-core processors, clustering, distributed computing, high-performance distributed file systems, high-speed interconnect, and internetworking technologies, was proven out in those organizations.
They constantly develop new technologies because these facilities often exceeded the capabilities of even the most leading-edge available technologies of the day. That are always in need of something new to meet their expanding requirements.
The nature of the work being done at these facilities justifies investments in new technology. A quick look at the most recent Top500 list of supercomputers finds four of the top ten in the world are in Department of Energy labs. Some are used for energy research, nuclear stockpile stewardship, and more. All these areas are vital to U.S. interests. Similar patterns are found internationally, especially in China and the European Union.
An important factor that allows the large labs and academic research centers to develop and deploy advanced technologies is the available technical staff and their collective expertise. A business does not have the time and probably not the skills needed to try out a new technology. They do not have the resources to put into getting something new working and optimized. In contrast, the labs and especially the academic computing centers have many technically skilled staffers who have the time to try out new things and work out the bugs.
As such, given the track record in other technical areas, it is worth looking at an initiative that seeks to take a new approach to data access and sharing.
See also: What’s Changing Faster? Data Pipeline Tech or the Role of the Data Scientist?
A look at data sharing in the national labs
One effort, the National Science Data Fabric (NSDF) Initiative, explores many of the issues businesses are dealing with in regard to data access and data sharing…only on a much larger scale than businesses ever encounter today.
A pilot project within the initiative aims to “democratize data-driven scientific discovery across an open network of institutions via data delivery, shared storage, computing, and more.” The group recently announced that it currently houses nearly 70 repositories ranging from geosciences databases to NASA imagery datasets. Specifically, the group is working with a multi-federation catalog containing more than 1.5 billion records from 68 community repositories, representing over 75 petabytes of data.
Managing that data and access to the data requires a robust infrastructure and software that makes it easier for scientists to use the data. As such, an important concept that NSDF has been focused on is ensuring that the repositories are set up to be findable, accessible, interoperable, and reusable (FAIR). You can see how developments and any new technologies that come out of the project would be of interest to data-driven businesses. The ideas embraced by the FAIR concept are at the heart of most business efforts that use automated data pipelines.
Some projects within the purview of the larger effort focus on issues with newer technologies. For example, there is a project called FARR (FAIR and ML, AI Readiness, and AI Reproducibility). It is focused on promoting better practices for AI, improving efficiency and reproducibility, and exploring gaps for data-centric AI.
Again, these are all issues that are starting to emerge in the corporate world. They are perhaps not at the level or scale of what the national labs are dealing with today. However, given the rapid growth of data in enterprises and the wide-scale embracement of AI, it is easy to see how any developments by the labs might be applicable to businesses in the future.
To summarize, the initiative is building a National Science Data Fabric (NSDF) and introducing a new approach for integrated data delivery and access to shared storage, networking, and computing that will democratize data-driven discovery. Additionally, the effort aims to “provide such democratized access to large-scale scientific data by developing production-grade scalable solutions to data storage, movement, and processing that can be deployed on commodity hardware, cloud computing, and HPC resources.”
Those points align with the objectives most data-driven businesses aim to overcome using automated data pipelines. So, it is fair to say the work and results of the NSDF initiative may be giving us a glimpse at future technologies and best practices that will help businesses keep up as their data volumes and access demands grow over time.
Salvatore Salamone is a physicist by training who has been writing about science and information technology for more than 30 years. During that time, he has been a senior or executive editor at many industry-leading publications including High Technology, Network World, Byte Magazine, Data Communications, LAN Times, InternetWeek, Bio-IT World, and Lightwave, The Journal of Fiber Optics. He also is the author of three business technology books.