Which query tool is used for Exadata

Offloading data warehouse objects with Hadoop

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 1 Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation With example scenario September 2016

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 2 The Topics Offloading Data Warehouse Objects with Hadoop Concepts and Technical Implementation ... 1 Big Data and more ..... 3 The quick use of the Data Warehouse with Hadoop ... 3 Hadoop Technology Promises Help ... 4 Offloading Data Warehouse? But how? ... 4 Finding the right candidates for offloading ... 5 Implementing this architecture ... 6 Requirements ... 6 Selecting the data in the DATA WAREHOUSE ... 6 Moving the data into the HDFS ... 7 The organization of the data in HDFS ... 8 The query part ... 8 SQL ... 8 Hive ... 8 Big Data SQL ... 9 The use of Big Data SQL and the auxiliary objects ... 11 Split Facts tables and the combination of in-memory database and HDFS ... 12 Summary of the factors that influence the read performance on HDFS ... 12 A star schema scenario ... 13 The initial data ... 13 Infrastructure used ... 14 Overview of the objects used in the scenario and the procedure ... 15 Step 1: Sqoop The offload process ... 15 Step 2: The data in HDFS ... 16 Step 3: The hive definition ... 16 Step 4 : External table in the Oracle database ... 17 Step 5. The queries ... 17 Experience, evaluation and outlook ... 24 Amount of data and reality ... 24 Technical know-how is manageable ... 24 Implementation, implementation effort and what really changes ... 24

Offloading data warehouse objects with Hadoop concept and technical implementation 3 Big data and more .. Today there are a number of application scenarios for big data, even if the term big data is becoming increasingly blurred and often has to be used for everything that software manufacturers offer. However, the discussion has heightened awareness of data in companies and, even more, of data in our everyday situations. In the past, data warehouse applications have tried again and again to at least summarize and consolidate company data and offer them precisely to the user for analysis purposes. Big data concepts promise more today: You deal with new data sources outside the company, you deal with new types of data, e.g. B. communication and movement data, you apply new principles to z. E.g. collect first then let's see what you get from it Without continuing this list you can already see: The classic data warehouse will continue to do its job, but with Big Data it will do even more. The quick benefit for the data warehouse with Hadoop The big data discussion is currently characterized by the search for use cases, new business options or the search for everything that can be done with the data that surrounds us. The present presentation is not looking for new use cases, but rather deals with the potential of Hadoop technology as an extension of the existing data warehouse solutions in companies. The fruits of the new technology are within reach, at least for the data warehouse, as this description will show. The Hadoop technology provides at least the following three interesting benefit aspects for the classic data warehouse: 1. Getting a grip on data volumes It is the nature of data warehouse systems that they grow, and the more important the role of DATA WAREHOUSE in the company becomes, the greater the amount of data that accumulates and thus the costs increase due to the storage of enormous amounts of data in highly developed relational database systems. 2. Perform high-performance evaluation even with extremely large amounts of data With large amounts of data, the performance aspect becomes relevant again despite faster database technology. Queries on tables of 100 terabytes are available. The cluster process that Hadoop technology brings with it promises to master the performance challenge with a manageable investment. 3. New types of data: everything does not always have to go into the data warehouse. The analyzes of strategies and tactics in companies have changed in recent years: up until now, companies have primarily measured company success using reports and statistics, but today they are looking for new options for action and business areas wherever you can find them. With the new types of data required for this, such as movement, weather, video or audio data, many people in charge ask themselves whether this data still belongs in a conventional DATA WAREHOUSE. On the other hand, you want to subject this data to an analysis process and share it with you

Combine offloading data warehouse objects with Hadoop concept and technical implementation 4 conventional data. So they belong in the DATA WAREHOUSE again and let the data volume grow even more. Hadoop technology promises help The Hadoop technology, which was developed in the middle of the last decade, essentially delivers two decisive features: 1. With HDFS (Hadoop Distributed File System), a file system distributed over any expandable computer network (cluster) in which one can without Wasting thoughts on place and space can get rid of all the data that has been gathered in a ton. 2. With MapReduce (or Spark as a successor process) an extremely parallelizable processing process that can also handle very large amounts of data. Hadoop also promises a number of advantages: Costs: The HDFS is content with standard hardware. You can keep very large amounts of data at low cost and no licenses have to be paid to manufacturers for Hadoop software from the open source Apache pool. Schema on Write vs. Schema on Read: Unlike in a database system, which first syntactically checks incoming data and z. If, for example, it is organized in columns (schema On Write), the HDFS stores the data directly regardless of a possible structure or syntactic correctness. In this way, large incoming data streams are supplied very quickly. However, the subsequent analysis process must make up for the structuring of this data (schema on read). Offloading data warehouse? But how? With regard to the data warehouse, the magic word is now offloading. I. E. the growing data of the DATA WAREHOUSE is moved to the HDFS. The main goal is to save costs: in data storage and by minimizing the effort involved in the ETL process (schema on read - thought). In some lectures, especially from the open source community, you can already hear suggestions according to which the DATA WAREHOUSE should be completely placed in the HDFS. But with this general consideration, one should not lose sight of the sense and purpose of a DATA WAREHOUSE. Because on the one hand, a DATA WAREHOUSE z. B. ensure a certain freedom from redundancy and harmonization in the data through normalization and further: A DATA WAREHOUSE should support the analysis process by deliberately structuring the data, examples here are the hierarchically structured dimensions and their relationships to a fact table in the multidimensional star model.

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 5 Finding the Right Candidates for Offloading So that means: we have to reconcile the different goals of new technology and sensible warehouse concepts. A first step is the selection of data types that are suitable for offloading 1. Output architecture of a data warehouse In the integration layer (data stage), data had previously only been kept temporarily for testing and processing purposes and the incoming data volumes were tried to be as small as possible by selection to limit data growth in the DATA WAREHOUSE. With Hadoop, the function of an integration layer can now be made more open: You can leave data there that you don't yet know who will analyze it later and how. You collect data in reserve and thus the idea of ​​the data lake has become a reality. Everything is now collected in this data lake, regardless of structure, source, design and subsequent use. Only parts of it really end up in the DATA WAREHOUSE. An integration layer repurposed as a data lake is also opened up for end users, which was previously unthinkable. In the enterprise layer (core warehouse layer) we find the bulk of the master and reference data. These reflect the company and business process structure. I. E. In connection with the corresponding key figures, these represent a strategic value for the company. Such information belongs in a standardized, structured space that has been cleared of redundancies. The best place for this is a relational database. What remains is the voluminous, granular, transactional movement data, which make up a very high proportion of the total amount of data in the DATA WAREHOUSE. The first candidates for offloading emerge in this pool. In the user view layer (data marts) with the multidimensional star schemes, the large fact tables are available. However, in the case of fact tables partitioned by time, only the older data. The partitions of the most recent report analysis periods would be kept for in-memory storage. The hierarchically structured dimensions also remain in the relational database. 1 For this consideration, an understanding of a basic warehouse architecture is assumed.

Offloading data warehouse objects with Hadoop concept and technical implementation 6 Possible target architecture of a data warehouse expanded with Hadoop technology Implementation of this architecture Requirements At least two requirements must be met when implementing such an architecture: 1. Data warehouse data should be selectable and performant can be moved periodically into the HDFS. 2. It should be possible to query the HDFS data in combination with the database data with SQL. Even more: It should be possible to query so-called hybrid tables with a single select, i.e. tables that, depending on the age of the data, have a part in the database (possibly in-memory) and a second part in HDFS. Offloading fact tables Selecting the data in DATA WAREHOUSE You can always select data in DATA WAREHOUSE using a corresponding SQL query. However, this selection can be made easier by the time-related partitioning and before

Standardize all offloading data warehouse objects with the Hadoop concept and technical implementation 7. Large fact tables are mostly partitioned in an Oracle DATA WAREHOUSE, so that the procedure is easily possible from a database point of view. Moving the data into the HDFS To move data from the database into the HDFS, either the bidirectional Sqoop (open source and location HDFS cluster) or the Oracle tool Copy2Hadoop (location Oracle database) is used. Sqoop is a batch utility for exporting and importing data from relational databases. Oracle provides an additional add-in (OraOop) for Sqoop, which optimizes Sqoop's access to the Oracle database. With Sqoop you can formulate a query parameter that works analogously to the where clause of a select statement. However, you can also address complete tables and, which suits the concept discussed here, individual table partitions using their names and load them into the HDFS. Thanks to the Oracle add-in OraOop, Sqoop knows the structure data as it is contained in the Oracle database dictionary very precisely. Field type conversion is precisely supported. Sqoop creates a MapReduce job on the Hadoop cluster. This means that the process can also be massively parallelized independently of the database. With Sqoop, however, the target world in HDFS can also be determined very well. Sqoop supports the most important file types such as Parquet, ORC 2, AVRO and of course simple CSV files. The target objects can be compressed when they are written, they can be partitioned and Hive metadata can also be generated automatically. The sample loading times in the following table come from a scenario with an Exadata Quarter Rack and Big Data Appliance Starter Rack (6 cluster nodes): Loading times from sqoop 2 https://cwiki.apache.org/confluence/display/hive/languagemanual+ orc # languagemanualorc-filestructure

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 8 The Oracle utility Copy2Hadoop creates a data pump file with the help of an external table and stores it in HDFS. This data pump file can later be read from the database again via an external table. And, don't forget, this file can also be read via Hive 3 in HDFS. So there is a certain openness. If you use an Oracle Exadata machine in connection with the Oracle Big Data Appliance, you can significantly accelerate the loading process with both tools by using the Infiniband protocol in between compared to TCP / IP. The organization of the data in the HDFS The storage of the data in the HDFS should not happen accidentally, because you have options that can have a significant effect on the later reading of the data. A first option is compression. Compression utilities (Gzip, Snappy) can e.g. B. be addressed by Sqoop during the write process. Compression factors of 2-4 can be easily achieved, which minimizes the IO read effort that has to be carried out later. So reading becomes faster. High-performance reading is also supported by the file type. The Hadoop scene is also moving towards storing data in an organized and structured manner. Parquet, ORC and AVRO files collect meta information about the structure of the data and save it in separate objects. Such files are almost independent databases with index blocks, column structures and metadata references. You should therefore consider using the outsourced warehouse data, e.g. B. to be stored as a Parquet or AVRO file. The Big Data SQL later used by Oracle will cope with all file types and benefit from the optimization that these file types bring with them. The query part SQL The swapped DATA WAREHOUSE data is now on a Hadoop system. However, the users continue to work with the data in the DATA WAREHOUSE. You will not notice anything of the additional data storage in the Hadoop. The reason for this is the use of SQL. SQL is used in many business intelligence and other analysis tools as the query language par excellence. If it is also possible to access HDFS data with SQL, all these tools can still be used by the user without restriction or additional learning effort. Hive In current Hadoop environments, reading is among other things. supported by Hive. Hive is a kind of SQL engine in the Hadoop world. The most important property of Hive is the metadata-like structuring of HDFS files. Then appear z. B. CSV files as if they were tables in a relational database that can be accessed with HiveQL. Hive also differentiates between external and internal tables. Hive organizes internal tables in its own 3. Hive is discussed below.

Offloading data warehouse objects with Hadoop concept and technical implementation 9 file directories. However, both types of tables are ultimately files in the HDFS directory that can be written, read and edited without hive. Big Data SQL Oracle Big Data SQL leverages this metadata feature from Hive. The smart scan processes already known from the Exadata are used to read the actual data. The query request of the select from the Oracle database is transferred to a small software package on each individual cluster node of the Hadoop system. Two goals are pursued: Reading and interpreting the HDFS data quickly, reducing the amount of data that flows to the actual database instance and thus to the user. As can be seen in the example below, this results in extremely good performance values ​​with horizontal scaling, because the method uses Hadoop's own parallelization via the computing cores of the Hadoop cluster nodes. C-based read routines and native Hadoop classes are used to understand the data in the HDFS data blocks. Only then does a conversion into Oracle-understandable column and field structures take place. The reading process in Hadoop by Big Data SQL Oracle has incorporated a number of optimizations into this process. B. Max / Min values ​​per column and block. Of course, this effect will only be noticeable after the HDFS data has already been read once and the index has been built up dynamically. The storage index takes into account query components such as: o Equality (=) o Inequality (<,! =, Or>) o Less than or equal (<=) o Greater than or equal (> =) o IS NULL o IS NOT ZERO

Offloading data warehouse objects with Hadoop concept and technical implementation 10 A storage index is node-local. I. E. it only ever refers to the data that is on a single Hadoop cluster account. If data from another node is affected by the query, it must be rebuilt. Up to 32 columns can be indexed for an HDFS file (hive table). However, if a second external table is defined from the database (see below), a further 32 indexes are obtained. The storage indexes work for numeric fields in the same way as for character and date fields. Big Data SQL brings many of the SQL functions known from the database to the cluster nodes in an offload step. The exploitation of such functions on the Hadoop side shifts the greatest part of the interpretation effort to the Hadoop side and if the functions are restrictive, then the amount of data to be moved is enormously minimized. Here is an excerpt from the Offloadable Functions:! Threads). However, this only applies to the database part of a query. The Hadoop part described here is subject to its own database-independent rules. The Hadoop technology is a parallelization technology. E.g.the Oracle Big Data Appliance machine in the 6-node starter rack version already has 264 cores, which means that a parallelization of 264 is possible, also taking into account other parallel activities on the machine. I. E. As a result, the smart scan process on the Hadoop machine is carried out with 264 parallel scan parts and the database part of the select maybe with 24 in parallel. The parallelization factor on the Hadoop machine is determined internally and can only be influenced by internal, non-public, parameters. With this understanding in the background, you can assess the potential performance of a query on HDFS data and at least influence the parallelization of the database:

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 11 o If you have a query with a high proportion of offloadable parts, the parallelism of the Hadoop system will serve you well. Parallelization in the database does not help here. o If you have few offload parts, you have to think about a higher parallelization in the database. Bloom filters are already known in database technology for forming joins. Big Data also uses SQL Bloom filters to form joins between HDFS files or between an HDFS file and Oracle database tables. It makes sense to join a large object with several small ones. This is ideal for a scenario in which you move a large fact table from a star schema into the HDFS while the highly structured dimensions remain in the database. The process creates so-called Bloom filter objects from the small tables in the database, sends them to the Hadoop cluster nodes and uses them there in the course of the Smart Scan processes to filter the large HDFS files. Reducing the amount of data moved between Hadoop and the database through Bloom Filters The use of Big Data SQL and the auxiliary objects The users continue to work from Oracle. However, the HDFS data is now on the Hadoop cluster. In addition to the database server, these are a number of separate machines. The Oracle means for accessing data outside of the database is called the External Table. To complete the interaction of the offload solution, an external table definition is required within the Oracle database that shows the way to the HDFS name node of the Hadoop system via suitable configuration entries (DEFAULT DIRECTORY DEFAULT_DIR). The easiest way to create this external table definition is to use the database package dbms_hadoop.create_extddl_for_hive. (An example of this is printed below in the scenario). The second component in this interaction is a hive table for documenting the column structure of the respective HDFS files. The easiest way to create this hive definition is with the Sqoop utility. Sqoop has a control parameter that stores a DDL definition in Hive from the database table to be transported. If you want to create an external table in Hive, which we want to do in our offload scenario, you start a dummy sqoop job with the query option z. B. 1 = 2. Then Sqoop doesn't move any data but creates a hive definition that you manually change to external.

Offloading data warehouse objects with Hadoop concept and technical implementation 12 Split facts Tables and the combination of in-memory database and HDFS At this point, a special case should be explained. In practice, you may want to offer different areas of the analysis data to users with different levels of performance. For example, data from the last reporting period (last day, last week, last decade, last month, last quarter ...) should be available for the majority of users in a particularly high-performance manner, while queries for older data can take a little longer for a smaller user group. The Oracle database knows the possibility of partitioning. With this aid, current data (the most recent partitions) can be kept in memory for queries in the millisecond range, while older data can be in HDFS. So far this is conceivable, but the requirement for the single SQL statement remains open. Users may want to be able to access in-memory and HDFS at the same time with a single select statement. You will not want to change your query tools or your query behavior just because DATA WAREHOUSE has outsourced part of the data to the HDFS. There are certainly several solutions for this. One of them is a clever combination of views that are ultimately coupled with UNION. In this case, clever means that the behavior of Big Data SQL, e.g. B. the offloading of functions in the where clause is not changed. 4 Summary of the factors that influence the read performance on HDFS After the presentation of all performance-relevant points, here is a summary and a small best practice list for the use of offloading: As an HDFS replacement for database tables, a suitable file type, e.g. B. Parket, AVRO or ORC can be selected. Although this will worsen the write performance of the Sqoop job, queries are more efficient. The files should be partitioned in HDFS. Compressing the data in the HDFS ensures less IO overhead for later reading. Parallel query in the database is only used for the database part of the query. With a large number of offload functions, this is of little use. The storage index will only be fully useful during the second read process. Joins between very large objects should be avoided. Joins between very large objects and many small tables are very efficient. In the case of queries with a partitioning criterion in the Where clause, a comparison between HDFS and the Oracle database can quickly turn out to be in favor of the Oracle database. If this partitioning criterion is missing, HDFS quickly has an advantage. 4 A solution for this can be obtained from the author.

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 13 A Star Scheme Scenario Now we have all the components to play through the whole thing in one scenario. The concept according to which we are aligning the scenario is a star schema with a partitioned fact table that is to be swapped out completely or in parts (with regard to the historical partitions) in the HDFS. This fact table is selected to be of different sizes for test purposes in order to recognize scaling effects. The dimension tables are still in the Oracle database. The data is then transported to the HDFS with Sqoop. Afterwards, comparisons with sample queries in both environments, i.e. in the conventional database-based environment as well as in the Hadoop-supported environment, are possible. The output data The output data consists of a star schema with a partitioned fact table with the following settings: Partitioned by time (month). Data from 2011 2016 and thus about 60 partitions. Hybrid Columnar Compression - Query low variant. This means that hybrid columnar compression is used, but in such a way that the best possible query performance is achieved. (The fact table contains randomly generated test data. This means that all fields are filled up to the defined maximum and have very heterogeneous values. Under these circumstances, naturally lower compression rates are achieved. In this scenario, only a factor of 2 was actually achieved, compared to a factor of 4-10 below Real conditions. This also has an impact on the test results, which would be better than they are even under these restricted conditions). Bitmap indexing on the foreign key fields of the fact table (requirement for Star Query Transformation). Star Query Transformation has been activated. The fact table was marked with parallel 12. I. E. It can be assumed that the database-related parts of the queries were carried out 12 times in parallel.

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 14 The size relationships of the fact table can be seen in the following overview. It contains the size information for the data in the database, in HDFS, in compressed and uncompressed form for the CSV and Parquet file formats. The columns under factor 3 indicate the absolute amounts after 3-fold mirroring by the HDFS. Infrastructure used, proportions of tables and corresponding HDFS files An Exadata Quarter Rack and a Big Data Appliance 5 were used as the test environment. Of course, environments without this hardware would also have been possible. But a big data appliance machine already has a fully installed and configured Hadoop infrastructure. It can be assumed that optimal test conditions will be found in this environment. 6 The Cloudera distribution is preinstalled on the big data machine: Apache Hadoop (CDH) Cloudera Impala Cloudera Search HBase Accumulo Apache Spark Apache Kafka Cloudera Manager The machine has 6 cluster nodes, each with: 2 x 22 Core (2.2GHz) Intel Xeon E5 2699 v4 8 x 32 GB DDR4 2400 Memory 12 x 8 TB 7,200 RPM High Capacity SAS Drives 2 x QDR 40Gb / sec InfiniBand Ports 4 x 10 Gb Ethernet Ports 1 x ILOM Ethernet Port The Exadata machine is equipped with the Big Data Appliance coupled via Infiniband. 5 Fact Sheet: http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdataappliance-datasheet-1883358.pdf 6 At this point, thanks to ISE Information Systems Engineering GmbH in Nuremberg, for their Exa - Has provided demo center for these tests. Companies can test their requirements in this demo center. www.ise-informatik.de/

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 15 Overview of the objects used in the scenario and the procedure The following 5 steps were carried out in the scenario. Step 1 and 5, of course, several times for the tables of different sizes (100/200/400 and 800 million records) and compression variants and file types. 1. Sqoop job on the fact table with the corresponding entries 2. The redistribution of the newly created files with a factor of 3 is done automatically by HDFS 3. The entry of the metadata in the Hive metastore is done manually after you have a DDL template generated by sqoop 4. Creation of an external table in the database for later SQL The package dbms_hadoop.create_extddl_for_hive helps. 5. Big Data SQL call / SQL queries. Those are the actual query test. Step 1: Sqoop The offload process Sequence of the comparison test scenario Sqoop is a batch utility that is started from a Linux cluster node. The following example establishes a connection to an Oracle database and the DWH schema. It reads the table F_UMSATZ and transfers it to a text file (default CSV). The oraoop add-in is included. A file in HDFS is mapped via a directory. Possibly existing data will be deleted beforehand. Otherwise the file would also be able to be updated. Sqoop creates a Map Reduce Job. This is parameterized here with parallel 100, i. H. 100 mappers start. The file is to be compressed when it is written to the HDFS (parameter z). sqoop import -D mapred.job.name = 'default oraoop' \> --direct --connect jdbc: oracle: thin: @ 21.108.5.88: 1521: dbm088 \> --username DHW --password DWH --table F_UMSATZ \> --as-textfile --delete-target-dir \> --target-dir / user / sqoop_output / f_umsatz -m 100 -z This is only a small selection of possible parameters. All parameters are described in the Sqoop Users Guide 7. The following overview contains the loading times for the different sized tables and sqoop parallelization sizes, ie the number of mappers: 7 https://sqoop.apache.org/docs/1.4.2/sqoopuserguide.html #_controlling_type_mapping

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 16 Step 2: The data in HDFS Offload loading times from sqoop The HDFS replicates the newly created file in the background with a factor of 3 to different cluster nodes. There is no need to comment on this automatic process. It just happens. Here is a view of the data in HDFS, once on the file itself, consisting of 149 individual files (stripes) in the directory f_umsatz and then a look at the first 4 records. You can see that the data is definitely readable. Step 3: The Hive Definition The file F_UMSATZ is in the HDFS is implemented as a directory The scenario works with a file directly in the HDFS. Hive's only job is to describe the metadata. Therefore, from Hive's point of view, this is an external table. So Hive will only keep the metadata ready and leave the data to itself. For internal Hive tables, Hive creates its own structure at another location in the HDFS. Type conversion questions may arise during the transfer. In this example, this actually happened for the Time_ID field. The original type DATE was not transferred accordingly by Sqoop, so that the later join queries on the star model unfortunately returned empty result sets. Only the correction according to the timestamp was ultimately successful. These Oracle-specific type conversions are explicitly referred to in the Sqoop Users Guide 8. 8 https://sqoop.apache.org/docs/1.4.2/sqoopuserguide.html#_controlling_type_mapping

Offloading data warehouse objects with Hadoop concept and technical implementation 17 CREATE EXTERNAL TABLE default.f_umsatz_hdfs (ARTIKEL_ID int, customer_id int, time_id TIMEstamp, REGION_ID int, channel_id int, sales int, quantity int, sales_total int, sales_kz string, tax DECIMAL (15, 10), PACKAGING TYPE string, trade class string, consignment goods string, stock goods string, description string, delivery_date DATE, INVOICE_DATE DATE, PAYMENT_TARGET DATE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/ user / sqoop_output' / f_umsatz '; Step 4: External table in the Oracle database The definition of the appropriate external table is provided by the package dbms_hadoop.create_extddl_for_hive. This package accesses the Hadoop cluster and reads out the corresponding metadata and also the storage locations in Hive. If the Hive entries are correct, this external table definition can be activated immediately in the database and the first queries can be formulated immediately afterwards. Step 5. The Queries Generation of External Table from Hive Metadata An excerpt from the query tests is documented here. The queries come from the seminar series Data Warehouse Technology in Focus and are printed there in booklet 9. A somewhat expanded fact table was used for these tests in order to get a little closer to practice. Only the database script variant is printed here. The HDFS variant looks the same. The external table F_UMSATZ_HDFS is only listed in the FROM clause instead of F_UMSATZ. Preliminary remark on these results: The results of this test are only used for 9 http://oracledwh.de/downloads/autoindex- 2.2.4 / index.php? Dir = downloads / 01_DWH_Seminare_Kurs_Materialien_Folien_und_Skripte / 01_01_DWH% 20Technik% 20im% 20Fokus% 20Rei & file = CMD_DWH_Kurzreferenz_V3.docx

Offloading data warehouse objects with Hadoop concept and technical implementation 18 Determination of the feasibility of star schema queries on objects in HDFS. They are less suitable for making a performance comparison statement between the relational database system and HDFS. To do this, more would have to be optimized on both sides, e.g. B. InMemory would be used in the Oracle database for performance-critical queries or additional techniques such as caching, Flash, materialized views and others would be used. Different file types (Parquet, ORC) would be used on the HDFS side. It should also be mentioned that the good query performance values ​​in the HDFS variant shown here are only due to the optimization of Oracle's Big Data SQL with function offloading, bloom filter, storage index, column pruning and the technical architecture of the big data appliance Machine (e.g. by the Infiniband protocol). Such query values ​​are z. B. not to be achieved only with Hive. Ultimately, the database queries run with Parallel 12, while the HDFS variant can parallelize in the 3-digit range. The relational database also allows parallelism of e.g. B. 128 to. General query about the number of rows in the largest table occurrence select count (*) from f_umsatz; COUNT (*) ---------- 819200000 Elapsed: 00: 00: 02.85 select count (*) from F_UMSATZ_HDFS; COUNT (*) ---------- 819200132 Elapsed: 00:00: 17.97 Comment: The database query runs over the existing bitmap index and is therefore particularly fast. Query 1b) Query with direct partition criterion SELECT / * + no cache * / sum (sales) sales FROM F_ SALES WHERE time_id = to_date ('10 .03.2011 ',' dd.mm.yyyy '); No Comment DB table HDFS file 102 1b Query on partition 00.66 03.14 204 1b 00.30 04.65 408 1b 00.33 09.04 819 1b 00.89 08.52 Comment: The query contains a time restriction of one day and thus the partitioning criterion in the database is met. Even if the total table grows, the amount of data in a partition does not grow at the same rate. In this query, the database does not care how big the table really is. The segment of a partition always remains the same size. The situation is different with HDFS. Here, the amount of data increases and, apparently, the effort involved in the query. These HDFS query times could certainly be improved if the HDFS file were also partitioned. But it is not here. Seen in this way, this comparison is incorrect. Query 1c) with and without time criterion 1st attempt SELECT / * + no cache * / sum (sales) sales FROM F_UMSATZ U, D_vertriebskanal V WHERE u.kanal_id = v.kanal_id and time_id = to_date ('10 .03.2011 ', 'dd.mm.yyyy') and V. sales channel = 'Shop'

Offloading data warehouse objects with Hadoop concept and technical implementation 19 2.Try SELECT / * + no cache * / sum (sales) sales FROM F_VOLUTION U, D_vertriebskanal V WHERE u.kanal_id = v.kanal_id - and time_id = to_date ('10 .03.2011 ',' dd.mm.yyyy ') and V. sales channel = 'Shop' No. Comment DB table HDFS file 102 1c query with time criterion 00.05 05.12 204 1c (partitioning) 00.75 02.05 408 1c 00.51 09.81 819 1c 01.04 18.01 102 1c query without time criterion (without 12.53 06.64 204 1c partitioning) 15.51 09.81 408 1c 18.71 06.64 819 1c 35.11 18.61 Comment: This example has two restrictive conditions. One that meets the partitioning criterion and one that does not meet the partitioning criterion. The test is carried out twice. The second time, the time criterion has been commented out so that only the restriction to a non-partition feature is made. In the second test attempt, the database query clearly lags behind the HDFS query. You do not have the possibility to limit the amount of data via partitioning. Query 2 Higher number of joins / Group by / Order by / 5 Filter / with and without database- Parallel 24 SELECT / * + parallel (f_umsatz 24) * / sum (u.umsatz) sales, R.REGION, Z.Quartals_nummer Quarter FROM F_UMSATZ U, D_Artikel A, D_Region R, D_Zeit Z, D_Kunde K, D_Vertriebskanal V WHERE U.Kunden_ID = K.Kunden_ID AND U.Zeit_ID = Z.Zeit_ID AND U.REGION_ID = R.Region_ID AND U.Artikel_ID = A.Artikel_ID AND U.Kanal_ID = V.Kanal_ID AND Z.JAHR_NUMMER = 2011 AND A.GRUPPE_NAME = 'Bad_Sanitaer' AND K.BERUFSGRUPPE = 'Arbeiter' AND R.REGION IN ('Middle', 'South', 'North') AND V. Sales channel = 'Shop' Group by R.Region, Z.Quartals_nummer Order by Z.Quartals_nummer; No Comment DB table HDFS file Higher number of joins and DB parallel 12 102 2 5 Joins / Group / Order by 5 02.40 03.85 204 2 Filter 03.65 13.44 408 2 04.95 18.08 819 2 06.81 20.69 Higher number of joins and DB parallel 24 102 2 01:59 03/10

Offloading data warehouse objects with Hadoop concept and technical implementation 20 204 2 5 Joins / Group / Order by / 5 02.48 13.36 408 2 Filters 03.43 17.80 819 2 04.07 21.20 Comment: You can see in this example that the parallelization in the database accelerates the query . This parallelization has no effect on the HDFS-based query. The Hadoop environment follows its own parallelization behavior and this also works without the database parallelization, i.e. in this case also with the first test with parallel 12. Query 3: 4 joins / 2 filters / higher group by database computation portion SELECT / * + no cache * / a.sector_name, z.year_number, r.land, sum (and sales), sum (and quantity) FROM f_ sales U, d_region r, d_time z, D_article a WHERE U.time_id = z.time_id AND U.REGION_ID = R.REGION_ID AND U.artikel_id = a.artikel_id and z.jahr_nummer = 2012 and R.Land = 'Bayern' GROUP by a.sparte_name, z.jahr_nummer, r.land; No Comment DB table HDFS file 102 3 4 Joins + 2 Filter / Group 02.55 08.41 204 3 by 03.60 13.73 408 3 04.81 12.47 819 3 06.82 18.23 Comment: This example is similar to the one before. However, the database has to calculate more here due to the larger Group By portion and the larger number of select fields. The response times for HDFS queries are almost the same. Query 6: Analytical function Rank select sum (and sales) sales, z.year_number year, z.quarter_number quarter, RANK () OVER (PARTITION by z.year_number ORDER BY sum (and sales) ASC) AS Ranking from f_ sales u , d_time z where z.zeit_id = u.zeit_id and z.jahr_nummer between 2010 and 2013 group by z.jahr_nummer, z.quartals_nummer order by z.jahr_nummer, order of precedence; No Comment DB table HDFS file 102 6 Rank + 2 joins 02.07 03.15 204 6 10.61 12.26 408 6 15.50 10.63 819 6 22.47 14.11 Comment:

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 21 The response time of the database query grows with the amount of data, the HDFS query initially grows. However, it then stagnates at a low level, while the database response time also increases but exceeds the HDFS time. Note the restriction to the time dimension. The temporary partitioning helps with this query. Query 7: Analytical function Rank and Subselect Select / * + no cache * / * from (select sum (and sales) sales, z.year_number year, z.quarter_number "top quarter", RANK () OVER (PARTITION by z .jahr_nummer ORDER BY sum (u.umsatz) ASC) AS Ranking from f_umsatz u, d_zeit z where z.zeit_id = u.zeit_id and z.jahr_nummer between 2010 and 2013 group by z.jahr_nummer, z.quartals_nummer) where priority = 1 order by year; No Comment DB table HDFS file 102 7 Rank and Subselect 04.00 02.93 204 7 08.07 12.30 408 7 10.39 10.62 819 7 20.11 17.08 Comment: Analogous to query 6. Query 16: Analytical function Rank / Group by / 2 Subselects select Top.Bundesland , top.prod_grp, top.top_umsatz, top.top_sequence, Bottom.Prod_grp, bottom.Bottom_Umsatz, bottom.bottom_sequence From (SELECT * FROM (SELECT r.land Bundesland, a.gruppe_name Prod_Grp, sum (u.umsatz) AS Top_Umsatz, RANK () OVER (PARTITION by r.land ORDER BY sum (u.umsatz) DESC) AS Top_sequence FROM f_umsatz U, d_artikel A, d_region r WHERE U.artikel_id = a.artikel_id and U.region_id = r.region_id GROUP by r .land, a.gruppe_name ORDER by r.land) WHERE Top_sequence <3) Top, (SELECT * FROM (SELECT r.land Bundesland, a.gruppe_name Prod_Grp, sum (u.umsatz) AS Bottom_Umsatz, RANK () OVER (PARTITION by r.land ORDER BY sum (u.umsatz) ASC) AS Bottom_sequence FROM f_umsatz U, d_artikel A, d_region r WHERE U.artikel_id = a.artikel_id and U.region_id = r.region_id GROUP by r.land, a.gru ppe_name ORDER by r.land) WHERE bottom_sequence <3) Bottom where top.bundesland = bottom.bundesland and top.top_sequence = bottom.bottom_sequence order by Top.Bundesland; No Comment DB table HDFS file 102 16 Ranks / Group by / 2 Subselects 17.69 09.69 204 16 44.84 21.65

Offloading data warehouse objects with Hadoop concept and technical implementation 22 408 16 01: 19.20 33.94 819 16 02: 35.01 01: 04.53 Comment: With this query, the response times for the HDFS query also increase. However, it only takes about half the time it takes to query the database. No restrictive partitioning criterion helps with this query. There is no restriction on the time dimension in the where clause. Query 19: Several subselects select a.land, round (a.top_3_umsatz / 1000000,2) Top_3_In_Mill, round (b.otal_pro_land / 1000000,2) Gesamt_pro_Land_in_Mill, round (a.top_3_umsatz / b.otal_pro_land * 100.2) Share_percent from (SELECT country, sum (sales) Top_3_Umsatz FROM (SELECT r.land, article_name article, sum (and sales) AS sales, RANK () OVER (PARTITION by r.land ORDER BY sum (and sales) DESC) AS Ranking FROM f_umsatz U, d_artikel A, d_region r WHERE U.artikel_id = a.artikel_id and U.region_id = r.region_id GROUP by r.land, a.artikel_name ORDER by r.land) WHERE Ranking <4 group by country) a, (SELECT land, sum (sales) Gesamt_pro_Land from f_umsatz U, d_region r U.region_id = r.region_id where group by land) b where a.land = b.land; No Comment DB table HDFS file 102 19 Multiple subselects 29.86 02/08 204 19 43.67 21.48 408 19 01: 04.46 02.29 819 19 01: 48.54 50.31 Comment: Analogous to query 16 .. Query 20: Ntile + nested subselects SELECT sum (sales) , share, (sum (sales) * 100 / total_ sales) as percent FROM (SELECT k.lastname as customer, sum (u. sales) as sales, ntile (4) over (order by sum (u. sales)) as share FROM d_customer K, f_umsatz U WHERE k.kunden_id = u.kunden_id GROUP by K.nachName), (SELECT sum (u.umsatz) as total_revenue FROM f_umsatz_hdfs U) GROUP by share, total_turnover; No Comment DB table HDFS file 102 20 Ntile + nested 15.33 04.99 204 20 Subselects 20.21 20.04 408 20 33.01 16.59 819 20 58.04 40.14 Comment: Analogous to query 16 ..

Offloading data warehouse objects with Hadoop concept and technical implementation 23 Query 22: Cube function SELECT a.gruppe_name, region, country, sum (and quantity), grouping (group_name) as AG, grouping (region) as RE FROM d_artikel a, f_umsatz U, d_region R WHERE U.artikel_id = a.artikel_id R.Region_id = U.Region_id AND Land = 'Schleswig Holstein' GROUP by cube (group_name, region, country); AND No. Comment DB table HDFS file 102 22 Cube 08.57 02.50 204 22 16.47 10.93 408 22 18.66 07.56 819 22 32.11 19.01 Comment: Similar to query 16. Query 23: Previous year comparison SELECT / * + no cache * / k.surname, e.g. .year_number as year, z.months_number as Mon, sum (and sales) as sales, lag (sum (and sales), 12) over (ORDER by z.year_number, z.month_number) as previous year FROM D_customer k, F_ sales U, D_zeit Z WHERE k.kunden_id = u.kunden_id AND Z.zeit_id = u.zeit_id AND z.year_number in (2010,2011) AND k.lastname = 'Bauer' group by K.nachName, z.jahr_nummer, z. month_number order by z.year_number, z.months_number; No Comment DB table HDFS file 102 23 Comparison to last year Lag 01.02 03.21 204 23 01.55 12.46 408 23 02.13 11.00 819 23 04.25 13.38 Comment: Analogous to query 16 ..

Offloading Data Warehouse Objects with Hadoop Concept and Technical Implementation 24 Experiences, Evaluation and Outlook Data Amounts and Reality The scenario has shown that it makes perfect sense not to move large amounts of data into the HDFS. The scenario was made with tables in the lower 3-digit GB range. The HDFS variant was able to show its strengths even in this manageable data volume. However, just a few seconds faster queries will not introduce a new infrastructure in the form of a Hadoop environment. But tables in the terabyte range and several billion records and warehouse environments that exceed 50, 100 terabytes of data volume or even environments in which a handful of tables with 5, 10, 20 etc. terabytes are located are conceivable. In these cases, an HDFS supplement is worthwhile. Technical know-how is manageable Incidentally, dealing with the scenario shows that the know-how build-up to cope with the technical environment is manageable. For experienced warehouse administrators, such a scenario can be mastered after a few days. Many points are known from the Linux world. Hadoop runs on Linux computers. Hadoop commands are based on Linux commands. SQL is well known. Hive is a SQL vehicle. The Oracle database, e.g. B. the behavior of external tables should also be known. There is no need to learn a new programming or scripting language. You can do without Java knowledge. Individual utilities such as sqoop ultimately bring their manual with them, which contains all the call parameters. Here one knows from the Oracle world z. B. the SQL Loader, which behaves similarly. What is really new is just a certain understanding of the Hadoop mindset. I. E. thinking in terms of cluster nodes and the type of parallelization. Implementation, implementation effort and what really changes The scenario shows that not very many components need to be taken into account. However, anyone who first deals with the technical architecture of a Hadoop cluster and tries to install the Cloudera Hadoop distribution or test the performance limits of a self-built cluster will only reach the actual goals very late. With a finished appliance such as Oracle's Big Data Appliance (BDA), you have a finished solution in which the offload concept is already implemented after 2-3 weeks. The particular advantage of this approach is that nothing needs to be changed in the logical concept of the existing data warehouse, i.e. the data models and layer structure in the DATA WAREHOUSE remain the same. The type of access for all applications and users working on the DATA WAREHOUSE remain the same. Used query tools, BI tools do not have to be exchanged. Users do not have to retrain. It is not a strategic direction for further developments in the future. Questions about the test and the document: [email protected] Tel 0177 5949132