Leveraging Open Source Technologies in Analytics Deployments

September 8, 2017

Open Source Developer Program Software User Concept

Reading Time: 3 minutes

Many organizations are eagerly hiring new data scientists fresh out of college. Many of those millennial data scientists have been educated in software development techniques that move away from reliance on traditional and commercial development platforms, toward adoption of open source technologies. Typically, these individuals arrive with skills in R, Python, or other open-source technologies.

Employers, as well as enterprise software vendors like Statistica, are choosing to support the use of these technologies, rather than forcing the new data scientists (who are scarce and highly valued resources) to adopt commercial tools. People with R, Python, C# or other language capabilities can integrate them into the Statistica workspace.

This type of framework allows a simple, one-click deployment. Deploying an R script by itself can be complex and difficult, although there are new, high-level technologies that simplify the process. Statistica has chosen to allow integration of the script directly into a workflow. The developer can then deploy that script into the Statistica enterprise platform, leverage the version control, auditing, digital signatures, and all the tools needed to meet a company’s regulatory requirements.

That’s a key advantage: The ability to incorporate an open source script into a regulated environment with security and version control without jeopardizing regulatory compliance. This capability is not entirely unique—some other, relatively new platforms can provide this ability to degree. But it has been feasible in the Statistica platform for a number of years, and is extensively proven in production deployments.

The capability came out of Statistica’s experience in the pharmaceutical industry, one of the most regulated of all commercial environments. Governments require extensive documentation and validation of virtually every business process involved in producing drugs for human consumption. This includes every aspect of data collection and reporting.

We have taken what we learned in this rigorously constrained context and applied it to a general analytical asset. That body of experience is differentiating among analytics platforms, as is the way in which scripts are incorporated into the Statistica workspace.

Within a workspace, we can create a script, and pull in the packages and components from the open source community. This enables Statistica adopters to leverage the collective intelligence of data scientists throughout the world, and contribute to the development of these open source technologies. This is in character with the open source community, in which developers not only contribute new code but inspect, test, criticize, and fine tune each other’s work. Our users are extending the capabilities of Statistica through these collectively developed packages.

The user creates a node in the workspace that can be parameterized. The developer can create an R script, and we can create a set of parameters attached to that node and then deploy that node as a reusable template. That template can be used like any other node within the workspace by a non-developer business user—what we also call a “citizen data scientist.”

We can import the data, merge, translate, create variables, etc. If we want to create a subset of data, we can deploy an R model developed specifically for this purpose by seasoned data scientist who has, in effect, added it to a library of gadgets that a business user can drop into the workspace, change the parameters, and get the standard output, as well as any downstream data that the script may produce.

From a business perspective, committing to open source adoption is attractive:

Because it’s free, so the adopter is spared an additional license cost; and
Because it opens up a world of new capabilities. There are new open-source packages being developed every day, and some will have quite compelling functionality.

There are, of course, uncertainties in adopting new code from an unregulated developer community. Because Statistica sells into highly regulated markets, we are audited annually to ensure that our code meets the requirements for regulatory compliance. Open-source code does not undergo that kind of audit, and that can introduce certain risks. But the platform enables deployment of the code into a rigorously controlled operational environment, mitigating this risk.

At least as important as the risk management element, the ability to promote the adoption of open- source scripting provides an attractive work environment for the current generation of data scientists. Given the intense competition for recent graduates, this can be a powerful incentive in itself for employers.

Find out more about TIBCO Statistica’s capabilities.