Archived Content

The following content is from an older version of this website, and may not display correctly.

When Amazon formally rolled out its memory-optimized r3 node types for its AWS service last week, the one-time ‘Greatest Bookstore on Earth’ said it was with the idea of enabling folks to spin up their own data warehouses in the Cloud in minutes. With a full 244GB of memory for the r3.8xlarge instance type, there should be plenty of room to spin up an in-memory database such as SAP’s HANA.

But as data warehouse architects will readily point out, the actual working mechanism of a data warehouse does not just auto-inflate itself. While HANA’s performance is orders of magnitude greater than a traditional relational database engine, existing data can’t just enter into the new order of doing things like fans breaking the gates of a rock concert.

Cloud databases
Data has to come from somewhere, and with most enterprise customers today it’s either a traditional physical data warehouse, a ‘data mart’ configuration from the 2000s or a tangle of existing transactional databases. A data warehouse is not a big database, in the same way a factory is not a workbench. Moving data from an existing database to a cloud-based database such as Amazon’s Redshift is not an automated process, though it’s also not particularly difficult.

A typical database migration to Redshift involves setting up an SQL query on-premise that reports data, in a long batch operation,

to a new database cluster operating in the AWS cloud. But that’s assuming that the database has just one function in the enterprise, and that the on-premise schema can translate to the in-memory and in-cloud schema without translation.

A data warehouse is not something that can be transplanted like an organ from an on-premise network to a single cloud cluster. The whole point of data warehousing is to enable multiple applications and functions across the business to query data from a single source, as though that source was native to the application. There are intermediate layers that give applications that impression, such as drivers for OLAP and OLTP. It’s hard enough to integrate these existing drivers with HANA on-premise, let alone in the Cloud.

FOCUS approached Amazon AWS with this problem: How should customers with existing data warehouses be expected to migrate existing setups from on-premise to the Cloud, either partly or entirely? And if it is to be a hybrid deployment, how should an r3 instance (not to be confused with SAP’s R/3 ERP system) be expected to co-exist with, and integrate into, existing data warehouse architectures? That was last week. As of this week, Amazon’s last word to us is that an answer remains forthcoming.

Landing data
One emerging suggestion from outside of Amazon for how to accomplish data warehouse integration with a memory-optimized instance comes from database provider MongoDB and involves Hadoop. Data warehouse architects suggest that data specialists use Hadoop’s MapReduce as a ‘landing zone’ for unstructured data, which can then be refined into a more regulated view for incorporation into the data warehouse.

MongoDB’s suggestion is that MapReduce jobs can be scheduled periodically, incorporating new segments of data into a Hadoop landing zone one-at-a-time.  “Once the data from MongoDB is available from within Hadoop, and data from other sources are also available,” its suggestion reads, “the larger dataset data can be queried against.”

As attendees of the recent SAP Sapphire Now conference in Orlando learned, even this process may take something closer to months than minutes, and even then may be safer if considered for prototypes rather than production data warehouses. One guest speaker representing auto parts maker Uni-Select discussed how his team began utilizing Amazon AWS (prior to the release of r3) for rolling out HANA at the core of their business intelligence systems, in staged increments, following agendas that are still in progress.

The reason for the careful timing? For Uni-Select, business reporting systems continue to produce as many as 152 BI analytics reports at specific times. Throwing off this timing reduces the reliability of forecasts produced using those reports. The data sampled by each report should be the same source, for quite the same reasons. But the size of the databases, for warehouse management and business analytics, that Uni-Select had used in its reporting had each grown to 7 terabytes over two years. So using an in-memory database for processing data for critical reports became itself a critical necessity. Uni-Select’s experience teaches us that regardless of how large or memory-optimized your cloud-based in-memory database instance may be, migrating your key data to that instance must be a carefully planned agenda that requires patience and a calendar, as opposed to a stopwatch.

This article first appeared in FOCUS Issue 36, available online here