Forget Hadoop, DARPA is turning to frameworks based on a much older and simpler programming language to put its big data into perspective
28 May 2013 by Penny Jones - DatacenterDynamics
Scientific researchers are no stranger to big data. Some can spend 80 to 90% of their time cleaning, normalizing and collecting data. In most cases they even write their own code to carry out projects at scale. But the commercialization of big data is starting to mean this skill is increasingly becoming required in the business environment as well.
You may have heard plenty about the open- source framework Hadoop by Apache, which is based on Java and uses the Google MapReduce programming model to handle large data sets. But if you want to deal with big data you should also know about the highly readable Python programming language, developed in the late 1980s. Python is up there with Java and Ruby (developed in the 1990s as a general purpose programming language) as one of the most widely used programming languages in the world. Add new data analysis tools that can be offered as a service and you remove some of the biggest pain points for big data analysis, according to Continuum Analytics president Peter Wang.
His 15-month-old US-based company is made up of developers with deep knowledge sets in the NumPy, SciPy, PyTables and Chaco libraries, and they are helping to drive development around open-source projects like Numba and Blaze. But they also make money. They offer big data analytics solutions based on Python that are already in use by investment bankers and hedge funds on Wall Street, back-end processors in Hollywood and soon in much more advanced forms by DARPA (the US Defence Advanced Research Projects Agency).
Wang says DARPA’s big data challenges, which it is trying to meet through its XDATA program, are not too different from those seen in more standard business environments. “It is just that they have more of a budget and a willingness to spend on cutting-edge stuff. DARPA has bought analytics tools from vendors for decades now but the problem with big data is not the analysis per se, but getting data pulled together so you can start doing the data analysis in the first place,” Wang says.
Continuum Analytics has provided 14 of its own developers for the DARPA project. They are developing the languages they work with into visualization technologies, looking specifically at scalability, interactivity and extensibility while maintaining a conceptual model for non-programmer end users. This last point is crucial for the success of the project, and is why DARPA has gone with a project that supports the open-source Python language.
“A lot of traditional software development languages like C++ and Java are rooted in computer sciences and have a lot of baggage carried over from decades past. Python was developed 20 years ago initially as a scripting language, but one of the cool things about it is the language is very extensible. Scientists then started using it about 12 years ago to create a library using Python as a scientific language. Python then went on to being scripted for high performance computing purposes. What we are building is the next generation of scientific Python analysis tools, but building it as a product and as a software solution for enterprises,” Wang says.
DARPA currently has about 20 different teams working on its XDATA project, which has been designed to meet the big data challenges of modern-day warfare.
“DARPA has analysts using our tools. They are doing advanced analytics at large scale but most commercial off-the-shelf products cannot handle that level of analytics, which is why they use Python,” Wang says.
DARPA says it wants to provide its users with an open-source software toolkit to allow collaboration among applied mathematics, computer science and data visualization communities, and this must be accessible to a wide variety of end user.
“If all goes well, it will be a system that allows people to do scalable data analytics and visualization on Petabytes of data in an efficient way, with structured, unstructured or semi structured data. We will provide a coherent view on top that. It is an extremely difficult challenge to solve right now.
“DARPA’s needs may be much bigger than what has traditionally been seen in the industry but because of Python, and what has been invested into it, the banks, and other major players, all seem to be looking at it for extremely large processing. It is all a numbers game for these people.”
This article first appeared in FOCUS magazine, Issue 29, our big data edition – out now! Read the digital edition in full here.