Getting started with Toad™ for Apache™ Hadoop®

Download Getting started with Toad™ for Apache™ Hadoop® Guide in PDF

This guide focuses on how to get started with Apache™ Hadoop® easily and comfortably using Toad™ for Apache™ Hadoop® and Cloudera® QuickStart VM. The QuickStart VM contains an entire Hadoop ecosystem, which makes it a great place to start exploring the world of Hadoop.

By following this guide you will learn how to:

  1. Set up connection to your Cloudera QuickStart VM ecosystem
  2. Create connection to your relational database
  3. Transfer data from your relational database to Hive
  4. View your transferred data in Hive Table Viewer
  5. Create and execute queries against data in Hive
  6. Manage files and folders stored on Hadoop Distributed File System (HDFS)

Setting up connection to your Cloudera QuickStart VM ecosystem

Before you can connect to your ecosystem using Toad for Apache Hadoop, you need to configure a few things on your local machine.

Local machine configuration

To configure your local machine, do the following:

  • Download and install Oracle VM VirtualBox® from VirtualBox website.
  • Download Cloudera QuickStart VM from Cloudera website (see Supported Solutionsto learn which versions are supported).
  • Open Oracle VM VirtualBox and import Cloudera QuickStart VM via File Menu | Import appliance.
  • If you have done everything correctly, the result should look similar to the screenshot below (the VM version may differ).


    Your VirtualBox version must be 5.0 and higher in order to work with Toad for Apache Hadoop.

Adding the ecosystem

Your local machine is now ready. To connect to your Cloudera QuickStart VM Hadoop ecosystem:

  1. Start Toad for Apache Hadoop.
  2. Click the Ecosystem button on Main Toolbar and choose Add New Ecosystem.

  3. Name your ecosystem and make sure that QuickStart VM for CDH (VirtualBox) option is selected as Detection Method. Click Next.

  4. Toad for Apache Hadoop detects virtual machines in Oracle VM VirtualBox. Once your Cloudera QuickStart VM is found, its status will be displayed. Assuming you haven't done any manual configuration yet, the result will look similar to the image below.
    The first step is to forward the necessary ports. Simply click the "Please click here to configure" note and Toad for Apache Hadoop does the job automatically.

    In case your Oracle VM VirtualBox contains multiple virtual machines, choose QuickStart VM from dropdown menu to continue.

  5. Now it is the time to start your Cloudera QuickStart VM, if it's not running already. Either click the underlined note, or start the virtual machine manually in VirtualBox.

  6. To connect to your Cloudera QuickStart VM, Toad for Apache Hadoop requires an entry in hosts file. Click the underlined "Please click here..." note to create it automatically.

  7. Once all requirements are fulfilled, the configuration part is successfully finished. Click Finish to end the wizard.
  8. Toad for Apache Hadoop connects to your Cloudera QuickStart VM ecosystem. Successful connection is indicated by the ecosystem name shown in the Information Panel and by green tick icon next to the Services icon on Main Toolbar.

Creating a connection to your relational database

In real world, you will most likely want to transfer your data stored it relational databases to Hive, where it can be used more effectively. Toad for Apache Hadoop makes this simple, you only need to create a connection to your database and execute a transfer.

To create a connection to your relational database:

  1. Open Toad for Apache Hadoop, click the Ecosystem button on Main Toolbar and select Configure Ecosystems.

  2. In opened dialog, click the Connections button to switch to Connections Configuration. Then click the green plus icon and select Oracle Connection.

  3. In the New Database Connection dialog, fill all the necessary connection details. Once you're done, click the Test Connection button to make sure your connection works properly. If there are no problems, apply the changes by clicking OK.

  4. Your connection is now saved and ready to use.

Transferring data from your relational database to Hive

With a connection to your relational database ready, it is time to transfer your data to Hive.

  1. In Toad for Apache Hadoop, connect to your Cloudera QuickStart VM ecosystem.
  2. When connected, switch to the Transfer perspective using the Transfer button on Perspectives Toolbar.

  3. Click the New Transfer button on Main Toolbaror click the Create New Transfer button on Transfer Explorer Toolbar.

  4. A new transfer will be created and shown in the window. This is the Transfer Definition window where you define the data source, target and also choose database items for transfer.

  5. First, enter a Job Name to describe your transfer. Note the swap button that instantly swaps data source and target.

  6. Now specify Source and Target. In this case, use the connection to your relational database as the Source and Hive as the Target. If you haven't added a connection to your database yet, you can do this now by clicking the Modify button.

  7. In the Source grid, mark all the items you would like to transfer. They will be shown in the Target grid.
    Items with a green plus icon will be added to the Target.
    Items with a red cross and green plus icon will replace existing items in Target.
    Each item is verified before being transferred, the verification result is indicated by Status icon.



    IconDescription
     The item has been successfully verified. It can be transferred entirely.
     The item cannot be transferred entirely. It may contain some properties/data types that are not supported in the target location. It can be transferred, but you should read the warning to see how the item will be changed.
     This item hasn't been successfully verified and cannot be transferred. You are not able to execute the transfer until you fix the problem or remove the item from selection.
  8. Problems found during item verification are shown in the Information section on the bottom. Double-clicking any item will highlight it in both Source and Target grids.

  9. Once you have selected items for transfer and they have been successfully verified, you can execute the transfer by clicking the Execute button on Main ToolbarorTransfer Explorer Toolbar.

  10. Your transfer will be now executed. This may take some time depending on the amount of data being transferred. The transfer progress is shown in Transfer Overview. You can select any item in List of Database Tables to see more information about its progress.

  11. Once the transfer is completed, you can access your data in Hive perspective.

 

View your data in Hive Table Viewer

With your data stored in Hive, you can now easily view it:

  1. In Toad for Apache Hadoop, connect to your ecosystem and click the Hive button on Perspectives Toolbar to switch to the Hive perspective.

  2. You can see all database items currently stored in Hive displayed in Hive Explorer (located on the left).

  3. Double-click any table you would like to view. The table will be opened in Table Viewer. There are several tabs, each contains different kind of information about the table. The most noteworthy is Data tab which displays sample of the data stored in the table..

Create and execute queries in Hive perspective

Now that your data is stored in Hive, you will most likely want to work with it using the Hive Query Language:

  1. In Toad for Apache Hadoop, connect to your ecosystem and switch to the Hive perspective.

  2. Click the Editor button on Main Toolbar to open the HiveQL Editor.

  3. You have now opened an instance of HiveQL Editor. Before you create a query, select the schema/database in which the query will be executed. This is done by choosing specific schema/database from the Schema dropdown menu.

  4. Now you can write your query. Note the Content Assist popups suggesting code parts you might want to use in your query.

  5. Once you're done, click the Execute button on Main Toolbaror click the Run Query button on the HiveQL Editor Toolbar.

  6. Wait until the query is executed successfully and you'll see the results. They will be shown in the Result section.


    Please note that in the world of Hadoop, executing queries takes significantly more time as opposed to relational databases. The amount of time varies greatly, depending on the complexity of your query, the amount of relevant data in Hive, the physical location of the data, etc.

    If your query fails, the error will be shown in Diagnostics tab.

Manage files and folders in HDFS

Hadoop stores all files on HDFS, which in many way acts similar to a local filesystem. You can use Toad for Apache Hadoop to manage files and folders stored in HDFS. This chapter describes basic operations such as:

Preview files

  1. In Toad for Apache Hadoop, connect to your ecosystem and click the HDFS button on Perspectives Toolbar to switch to the HDFS perspective.

  2. Use HDFS Explorer on the left to navigate between folders and files. Once you get to the file you want to preview, simply double-clickit.

    Note: All files are opened in basic plain text viewer (notepad-like). Images and more complicated file formats are not currently supported and opening them will result in incomprehensible preview (similar to opening such files in notepad).

  3. The selected file is opened in File Viewer. You can set the amount of kilobytes of data shown and whether they're displayed from the beginning or the end of the file.

Download files/folders

You can download files/folders from HDFS to your local machine, similar to a network disk storage.

  1. Connect to your ecosystem in Toad for Apache Hadoop and switch to the HDFS perspective.
  2. In HDFS Explorer, select the file/folder you would like to download. Now either click the Download button on Main Toolbar or right-click the selected item and select Download.

  3. In the opened file dialog, navigate to the save location and click Save.

  4. Your file/folder will be downloaded to the target location. This may take a while depending on the file/folder size and your download speed.

Upload files

You can also upload files from your local machine to HDFS.

  1. Connect to your ecosystem in Toad for Apache Hadoop and switch to the HDFS perspective.
  2. In HDFS Explorer, right-click the destination folder it and select Upload or click the Upload button on Main Toolbar.

  3. In the File Explorer dialog, select the file(s) you want to upload.

  4. The file(s) will be uploaded to HDFS, into the folder you selected (if you haven't selected one, the file(s) will be uploaded to the root folder, if possible). Depending on the file(s) size and your upload speed, this may take some time.