Showing posts with label Oracle Database. Show all posts
Showing posts with label Oracle Database. Show all posts

Monday, August 7, 2017

Auto enabling APPROX_* function in the Oracle Database

With the releases of 12.1 and 12.2 of Oracle Database we have seen some new functions that perform approximate calculations. These include:

  • APPROX_COUNT_DISTINCT
  • APPROX_COUNT_DISTINCT_DETAIL
  • APPROX_COUNT_DISTINCT_AGG
  • APPROX_MEDIAN
  • APPROX_PERCENTILE
  • APPROX_PERCENTILE_DETAIL
  • APPROX_PERCENTILE_AGG

These functions can be used when approximate answers can be used instead of the exact answer. Yes can have many scenarios for these and particularly as we move into the big data world, the ability to process our data quickly is slightly more important and exact numbers. For example, is there really a difference between 40% of our customers being of type X versus 41%. The real answer to this is, 'It Depends!', but for a lot of analytical and advanced analytical methods this difference doesn't really make a difference.

There are various reports of performance improvement of anything from 6x to 50x with the response times of the queries that are using these functions, instead of using the more traditional functions.

If you are a BI or big data analyst and you have build lots of code and queries using the more traditional functions. But what if you now want to use the newer functions. Does this mean you have go and modify all the code you have written over the years? you can imagine getting approval to do this!

The simple answer to this question is 'No'. No you don't have to change any code, but with some parameter changes for the DB or your session you can tell the database to automatically switch from using the traditional functions (count, etc) to the newer more optimised and significantly faster APPROX_* functions.

So how can you do this magic?

First let us see what the current settings values are:

SELECT name, value 
FROM   v$ses_optimizer_env 
WHERE  sid = sys_context('USERENV','SID') 
AND    name like '%approx%';

NewImage

Now let us run a query to test what happens using the default settings (on a table I have with 10,500 records).

set auto trace on

select count(distinct cust_id) from test_inmemory;

COUNT(DISTINCTCUST_ID)
----------------------
		  1500


Execution Plan
----------------------------------------------------------
Plan hash value: 2131129625

--------------------------------------------------------------------------------------
| Id  | Operation	     | Name	     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |		     |	   1 |	  13 |	  70   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE      |		     |	   1 |	  13 |		  |	     |
|   2 |   VIEW		     | VW_DAG_0      |	1500 | 19500 |	  70   (2)| 00:00:01 |
|   3 |    HASH GROUP BY     |		     |	1500 |	7500 |	  70   (2)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| TEST_INMEMORY | 10500 | 52500 |	  69   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Let us now set the automatic usage of the APPROX_* function.

alter session set approx_for_aggregation = TRUE;

SQL> select count(distinct cust_id) from test_inmemory;

COUNT(DISTINCTCUST_ID)
----------------------
		  1495


Execution Plan
----------------------------------------------------------
Plan hash value: 1029766195

---------------------------------------------------------------------------------------
| Id  | Operation	      | Name	      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	      |     1 |     5 |    69	(0)| 00:00:01 |
|   1 |  SORT AGGREGATE APPROX| 	      |     1 |     5 | 	   |	      |
|   2 |   TABLE ACCESS FULL   | TEST_INMEMORY | 10500 | 52500 |    69	(0)| 00:00:01 |
---------------------------------------------------------------------------------------

We can see above that the APPROX_* equivalent function was used, and slightly less work. But we only used this on a very small table.

The full list of session level settings is:
alter session set approx_for_aggregation = TRUE;
alter session set approx_for_aggregation = FALSE;

alter session set approx_for_count_distinct = TRUE;
alter session set approx_for_count_distinct = FALSE;

alter session set approx_for_percentile = 'PERCENTILE_CONT DETERMINISTIC';
alter session set approx_for_percentile = PERCENTILE_DISC;
alter session set approx_for_percentile = NONE;

Or at a system wide level:

alter system set approx_for_aggregation = TRUE;
alter system set approx_for_aggregation = FALSE;

alter system set approx_for_count_distinct = TRUE;
alter system set approx_for_count_distinct = FALSE;

alter system set approx_for_percentile = 'PERCENTILE_CONT DETERMINISTIC';
alter system set approx_for_percentile = PERCENTILE_DISC;
alter system set approx_for_percentile = NONE;

And to reset back to the default settings:

alter system reset approx_for_aggregation;
alter system reset approx_for_count_distinct;
alter system reset approx_for_percentile;

Wednesday, January 4, 2017

Explicit Semantic Analysis in Oracle 12.2c Database

A new Oracle Data Mining algorithm in the Oracle 12.2c Database is called Explicit Semantic Analysis.

[The following examples are built using Oracle Data Miner 4.2 (SQL Developer 4.2) and the Oracle 12.2 Database cloud service (extreme edition) ]

The Explicit Semantic Analysis algorithm is an unsupervised algorithm used for feature extraction. ESA does not discover latent features but instead uses explicit features based on an existing knowledge base. There is no setup or install necessary to use this algorithm All you need is a licence for the Advanced Analytics Option for the database. The out from the algorithm is a distance measure that indicates how similar or dis-similar the input texts are, using the ESA model (and the training data set used). Let us look at an example. Setup training data for ESA Algorithm

Oracle Data Miner 4.2 (that comes with SQL Developer 4.2) has a data Wiki data set from 2005. This contains over 200,000 features. To locate the file go to.

...\sqldeveloper\dataminer\scripts\instWikiSampleData.sql

This file contains the DDL and the insert statements for the Wiki data set.

NewImage

After you run this script a new table called WIKISAMPLE table exists and contains records

NewImage

This gives us the base/seed data set to feed into the ESA algorithm.

Create the ESA Model using ODMr

To create the ESA model we have 2 ways of doing this. In this blog post I'll show you the easiest way by using the Oracle Data Miner (ODMr) tool. I'll have another blog post that will show you the SQL needed to create the model.

In an ODMr workflow create a new Data Source node. Then set this node to have the WIKISAMPLE table as it's data source.

Next you need to create the ESA node on the workflow. This node can be found in the Models section, of the Workflow Editor. The node is called Explicit Feature Extraction. Click on this node, in the model section, and then move your mouse to your workflow and click again. The ESA node will be created.

Join the Data Node to the ESA node by right clicking on the data node and then clicking on the ESA node.

Double click on the ESA node to edit the properties of the node and the algorithm.

NewImage

Explore the ESA Model and ESA Model Features

After the model node has finished you can now explore the results generated by the ESA model. Right click on the model node and select 'View Model'. The model properties window opens and it has 2 main tabs. The first of these is the coefficients tab. Here you can select a particular topic (click on the search icon beside the Feature ID) and select it from the list. The attributes and their coefficient values will be displayed.

NewImage

Next you can examine the second tab that is labeled as Features. In this table we can select a particular record and have a tag cloud and coefficients displayed. The tag cloud is a great way to see visually what words are important.

NewImage

How to use the ESA model to Compare new data using SQL

Now that we have the ESA model created, we can not use it model to compare other similar sets of documents.

You will need to use the FEATURE_COMPARE SQL function to evaluate the input texts, using the ESA model to compare for similarity. For example,

SELECT FEATURE_COMPARE(feat_esa_1_1
          USING 'Oracle Database is the best available for managing your data' text 
          AND USING 'The SQL language is the one language that all databases have in common' text) similarity 
FROM DUAL;
NewImage

The result we get is 0.7629.

The result generate by the query is a distance measure. The FEATURE_COMPARE function returns a comparison number in the range 0 to 1. Where 0 indicates that the text are not similar or related. If a 1 is returned then that indicated that the text are very similar or very related.

You can use this returned value to make a decision on what happens next. For example, it can be used to decide what the next step should be in your workflow and you can easily write application logic to manage this.

The examples given here are for general text. In the real world you would probably need a bigger data set. But if you were to use this approach in other domains, such as legal, banking, insurance, etc. then you would need to create a training data set based on the typical language that is used in each of those domains. This will then allow you to compare documents with each domain with greater accuracy.

[The above examples are built using Oracle Data Miner 4.2 (SQL Developer 4.2) and the Oracle 12.2 Database cloud service (extreme edition) ]