Amazon DynamoDB Part III: MapReducin’ Logs
Why is the following infographic important? For one reason: To remind us that there’s not a single machine that is able to process a whole sample at once.
Instead, we need to resort to distributed computing. About 20 years ago, we’ve had to roll out custom middleware (like Message Passing Interface and PVM) and devise custom Supercomputers to be able to process this data. In fact, as middleware, they were mostly aimed at producing data on a large scale rather than really managing the data itself. At the same time, they required custom infrastructure, specialized hardward and networking gear, which was most often than not only available from select vendors, like Cray and IBM, thus making it impractical for general business usage.
So far in the AWS DynamoDB series we’ve been through:
On the other hand, about 10 years ago, the Internet brought the very same computing scale, but instead used inexpensive computing power. Actually, if you think that Google started off by using off-the-shelf hardware to augment its processing power, you’re likely to question why the need of custom gear.
The Case for Hadoop
In my opinion the following infographic from Domo captures the current state of Data:
[Newvem S3 analytics helps you define, configure, implement and validate your storage policies. Use Newvem to validate your S3 storage structure and policies. Learn more about Newvem's features]
How things differed? Quite simply, when off-the-shelf hardware was being used, failure switched sides on the whole picture: Instead of being part of the problem, they were factored as ‘non-functional requirements’ of the underlying solution: Accept Failure, and offer countermeasures.
Hadoop is one of those answers: It is a platform which addresses the lessons learned for inexpensive large-scale computing, encompassing the ideas from the original Google Papers on MapReduce and Google File System. Then open source, Virtualization and Cloud Computing were brought in, and this context created another case: Cloud-Based Big Data Processing.
Elastic MapReduce and DynamoDB
From a user standpoint, Elastic MapReduce is just another AWS Service. As such, you can have an API Endpoint where you can issue calls, as well as a Console.
In Elastic MapReduce, you are able to define Job Flows, where you can define your Hadoop Cluster Topology, which consists of the EC2 Type and Number of Machines in each Instance Group. There are three instance groups:
- Master Node, a single machine which runs the Hadoop Job Tracker and Name Node
- Core Nodes, which keeps the actual HDFS Data Nodes and Task Trackers
- Task Nodes, which only run Task Trackers
From those three, Master Node keeps the HDFS Metadata via the Name Node. Core Nodes keep both HDFS Blocks (like the contents of inodes in a regular file system) as well as run the actual Map and Reduce Tasks. Task Nodes only run Task Trackers (like the Core Nodes), so they are only able to run the Tasks, but not keep the Data.
You can resize a job flow dynamically (adding / reducing the number of Core and Task Nodes), and even leverage savings by using Spot Instances. For more information, take a look at the EMR Developer Guide – This one is constantly updated, and I highly recommend it.
As it is, it looks like a specialized platform with no real applications to start using. However, as each platform, there’s a fledging ecosystem of tools and utilities. Elastic MapReduce leverages that, by combining regular Hadoop Tools like Pig and Hive, with extra support for AWS Features, like Dynamo and S3.
Hive and AWS
Hive is one of the most famous Hadoop Tools. It offers users an interface for Hadoop Data modeled after SQL, thus becoming not only a Data Analysis Platform, but you can also do some ETL (Extract-Transform-Load) tasks easier.
In the AWS Version of Hive for Elastic MapReduce, it also allows you to use DynamoDB and S3 out of the box. Being Java based, Hive also understands that you’re likely to use it combined with other tools, and as such, it offers integration with JDBC based applications, so you can interact with tools like Squirrel-SQL out of the box.
Ok, enough talk. First of all: make sure you have an EC2 SSH Key on the region you’d like to use. Then, open your AWS Console and open the Elastic MapReduce Section. Then, simply click in “Create New Job Flow”. Pick a name for your Job Flow, and Select “Hive Program” from the list, then click Next:
In Elastic MapReduce, you can declare full processing, declaring all the involved steps. However, in this case, we’ll ask for an Interactive Hive Session, so select it and click “Continue”:
Now you need to declare your Cluster. Here are the settings I’ve picked.
You can do a smaller job flow, but make sure you have at least one other machine in your other instance groups. Otherwise, Elastic MapReduce will run the master in the so-called “Pseudo-Distributed Mode”, where all your daemons will run on master, but leave very little room for more memory-intensive tasks – It will also effectively disable your rescaling options.
Then, select your EC2 Key Pair and click “Continue”:
Click on “Proceed with no Bootstrap Actions” then “Continue”. When reviewing, make sure you have set “Interactive Hive Section” and that your EC2 Key Pair is correctly set. If everything is okay, click on “Create Job Flow”:
Meanwhile, resize your DynamoDB Tables. For the log_total table, I kept doubling its Read Throughput until I hit 64 Write Units. Remember the DynamoDB Throughput Resize Constraints:
- You can increase both Write and Read at once, or just one of them, and decrease one and increase the other
- The maximum for increasing is twice the provisioned value
- Actually, you can only decrease a given thoughput once a day
Once you’ve got the constraints you want, switch back into the Elastic MapReduce Section of the Console. Wait until the job flow you’ve created gets into “RUNNING” state. Then, log into the Master Node via SSH.
Once in SSH, run Hive:
Actually, Hive might not be present. Wait a while as it might be configured locally. Once you’re in Hive, it acts like an SQL Command Shell.
Whatever Happens in Hive, Stays in Hive
Hive acts like a Relational SQL Database. Internally though, it is an ETL Engine, in the sense that you can only export and import records, and issuing queries, but not do things like Deleting / Updating Records.
So we’ll start by declaring the existing DynamoDB Table:
Once you do that, you can access your DynamoDB Tables like they were just SQL Tables:
How does this integration work?
The secret sauce in AWS’ Version of Hive is an extra module called “hive-bigbird-handler.jar”, where it declares the DynamoDB Storage Handler described above. Basically, it offers an interface for DynamoDB based on scanning all the records, and mapping Hive and DynamoDB Columns.
While it resorts to scanning, it does have built-in logic for managing free thoughput read/write units, with the needed retry/backoff logic parts as well. Actually, by default, it consumes only 50% of your Read Units. If you want to have all your data available to Hive, you need to raise the value of the dynamodb.thoughput.read.percent property set into a higher factor (the default is 0.5):
Originally, the first version took around 470 seconds (7:50 minutes) to run. Setting the property into 1.0 meant it used all the existing read units and scanned in a shorter time – 287 seconds (4:47 minutes).
Hive is also able to talk to S3. So, if you want to export into a S3 Bucket, all you need is to tell it to do an INSERT OVERWRITE:
By default, Hive Records are like CSV, except that it uses Control-A as a delimiter. Also, you can declare tables using S3 Sources – This article in particular, gives you lots of tips about how to integrate both. If you do have large relational data to dump into Hive for usage, you can also combine it with tools like Sqoop to make it easier to integrate Hive with your Relational Databases.
Unfortunately, Amazon haven’t released the sources for hive-bigbird-handler.jar, which is a shame considering its usefulness. Of particular note, it seems it also includes built-in support for Hadoop’s Input and Output formats, so one can write straight on MapReduce Jobs, writing directly into DynamoDB.
Another thing is that Hive keeps its metadata outside Hadoop. In fact, it needs a connection into another JDBC Database to keep its so-called Metastore, where it keeps data about each “table” it knows, and how to access them. By default, in Elastic MapReduce, Hive Metastore is kept into an internal MySQL Database which is created at start up. If you’re having on-and-off Hive Sessions to deal with, it might come handy to override your Hive configurations into another external MySQL Database, so you could launch new job flows without re-declaring all those tables – Or better yet, share them across multiple job flows.
If you compare, DynamoDB is one of the most attractive Services in AWS today. Combined with Hive, it makes you think about your data as part of a whole scenario, including:
- Online Data Services, like Session Management for Web Applications
- Exporting/Importing with S3
- Integrating with your Legacy/Relational Database Systems via Sqoop
- Combining with HDFS and hBase for faster OLAP/Analytical Work loads
- Combining the newer S3 Integration with Glacier for Auditing / Compliance Requirements
It is also worth having a look at the Articles Section for Elastic MapReduce on the AWS Website. In particular, tools for dealing with Logs (especially CloudFront) are discussed, as well as quick start guides for other Elastic MapReduce features.
DynamoDB is great, and it quickly became my favourite database for my latest designs. I hope I have convinced you why it is worth the reputation it quickly gained.
[Newvem analytics tracks you AWS cloud utilization:
- Hourly Utilization Pattern Analysis
- Reserved Instances Decision Tool
- Resource Resizing Opportunities
About the Author
Aldrin Leal, Cloud Architect and Partner at ingenieux Aldrin Leal works as an Architect and QA Consultant, specially for Cloud and Big Data cases. Besides his share of years worth between the trenches in projects ranging from Telecom, Aerospatial, Government and Mining Segments, he is also fond with a passion to meet new paradigms and figure a way to bring them into new and existing endeavours.
Keywords: Amazon web services, Amazon AWS console, Amazon Cloud Services, DynamoDB, Dynamo db, SimpleDB, Amazon S3, cloud Scalability, Cloud Performance, Hadoop, Hive, Elastic MapReduce, AWS Console, EC2, S3 Bucket, MySQL, ETL, EMR, EC2 Key Pair, Glacier