Database Research Group Seminar

2012 Oct 10 at 14:30

DC 1304

MixApart: Decoupled Analytics for Shared Storage Systems

Gokul Soundararajan, NetApp

Data analytics and enterprise applications have very different storage functionality requirements. For this reason, enterprise deployments of data analytics are on a separate storage silo. This generates additional costs and inefficiencies in data management e.g., whenever data needs to be archived, copied, or migrated across silos. We design MixApart, a scalable data processing framework for shared enterprise storage systems. With MixApart, a single consolidated storage back-end manages enterprise data and services all types of workloads, thus simplifying data management and lowering hardware costs for enterprises. In addition, MixApart enables the local storage performance required by data analytics through an integrated data caching and scheduling solution. We expect that our decoupled, stateless cache design will be most useful for cross-data center deployments and for transparent, and consistent refresh of analytics data upon updates to underlying enterprise data. We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides i) up to 28% faster performance than the traditional ingest-then-compute worklows used in enterprise IT analytics, and ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.