Showing posts with label Oracle. Show all posts
Showing posts with label Oracle. Show all posts

Friday, April 19, 2013

Part 2–Getting start with Statistics for Oracle Data Science projects

This is the second blog on getting started with Statistics for Oracle Data Science projects.

In this blog post I will look at 3 more useful statistical functions that are available in the Oracle database. Remember these come are standard with the database. The first function I will look at is the WIDTH_BUCKET function. This can be used to create some histograms of the data. A common task in analytics projects is to produce some cross tabs of the data. Oracle has the STATS_CROSSTAB. The last function I will look the different ways you an sample the data.

Histograms using WIDTH_BUCKET

When exploring your data it is useful to group values together into a number of buckets. Typically you might want to define the width of each bucket yourself before passing the data into your data mining tools, but before you can decide what these are you need to do some exploring using a variety of widths. A good way to do this is to use the WIDTH_BUCKET function. This takes the following inputs:

Expression: This is the expression or attribute on which the you want to build the histogram.

Min Value: This is the lower or starting value of the first bucket

Max Value: This is the last or highest value for the last bucket

Num Buckets: This is the number of buckets you want created.

Typically the Min Value and the Max Value can be calculated using the MIN and MAX functions. As a starting point you generally would select 10 for the number of buckets. This is the number you will change, downwards as well as upwards, to if a particular pattern exists in the attribute.

Using the example scenario that I used in the first blog post, let us start by calculating the MIN and MAX for the AGE attribute.

image

Lets say that we wanted to create 10 buckets. This would create a bucket width of 7.3 for each bucket, giving us the following.

Bucket 1 : 17-24.3
Bucket 2: 24.3-31.6
Bucket 3: 31.6-38.8
Bucket 4: 38.8-46.1
Bucket 5: 46.1-53.4
Bucket 6: 53.4-60.7
Bucket 7: 60.7-68
Bucket 8: 68-75.3
Bucket 9: 75.3-82.6
Bucket 10: 82.6-90

These are the buckets that the WIDTH_BUCKET function gives us in the following:

SELECT cust_id,
       age,
       width_bucket(age,
                    (SELECT min(age) from mining_data_build_v),
                    (select max(age)+1 from mining_data_build_v),
                    10)  bucket
from mining_data_build_v
where rownum <=12
group by cust_id, age

image

An additional level of detail that is needed to allow us to plot the histograms for AGE, we need to aggregate up for all the records by bucket.

select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl;

image

We can take this code and embed it into the GATHER_DATA_STATS procedure that I gave in my Part 1 blog post.

Cross Tabs using STATS_CROSSTAB

Typically cross tabulation (or crosstabs for short) is a statistical process that summarises categorical data to create a contingency table. They provide a basic picture of the interrelation between two variables and can help find interactions between them.

Because Crosstabs creates a row for each value in one variable and a column for each value in the other, the procedure is not suitable for continuous variables that assume many values.

In Oracle we can perform crosstabs using one of their reporting tools. But if you don’t have one of these we will need to use the in-database function STATS_CROSSTAB. This function takes three parameters, the first two of these are the attributes you want to compare and the third is what test we want to perform. The tests available include:

  • CHISQ_OBS: Observed value of chi-squared
  • CHISQ_SIG: Significance of observed chi-squared
  • CHISQ_DF: Degree of freedom for chi-squared
  • PHI_COEFFICIENT: Phi coefficient
  • CRAMERS_V: Cramer’s V statistic
  • CONT_COEFFICIENT: Contingency coefficient
  • COHENS_K: Cohen’s kappa

CHISQ_SIG is the default.

Now let us look at some examples using our same data set.

image

Sampling Data

When our datasets are of relatively small size consisting of a few hundred thousand records we can explore the data is a relatively short period of time. But if your data sets are larger that that you may need to explore the data by taking a sample of it. What sampling does is that it takes a “random” selection of records from our data set up to the new number of records we have specified in the sample.

In Oracle the SAMPLE function takes a percentage figure. This is the percentage of the entire data set you want to have in the Sampled result. 

image

There is also a variant called SAMPLE BLOCK and the figure given is the percentage of records to select from each block.

image

Each time you use the SAMPLE function Oracle will generate a random seed number that it will use as a Seed for the SAMPLE function. If you omit a Seed number (like in the above examples), you will get a different result set in each case and the result set will have a slightly different number of records. If you run the sample code above over and over again you will see that the number of records returned varies by a small amount.

If you would like to have the same Sample data set returned each time then you will need to specify a Seed value. The Seed much be an integer between 0 and 4294967295.

image

In this case because we have specified the Seed we get the same “random” records being returned with each execution.

Monday, March 25, 2013

Review of Oracle Magazine–May/June 1999

The headline articles for the May/June 1999 edition of Oracle Magazine were focused on using internet technologies to allow businesses to work together more efficiently.

image

Other articles included:

  • Oracle and Hewlett-Packard announce the a prebuild Oralce 8i appliance. This had a code-name of Raw Iron.
  • Oracle announce a $100million venture fund to promote innovation by companies developing products and services based on Oracle 8i
  • Oracle 8i comes with the an enhanced feature that automatically keeps a standby database synchronized with the production database. This is called the Automated Standby Databases (ASD) and hopes to reduce the amount of manual work DBAs need to perform.
  • Some helpful suggestions on how to go about implementing parallel DML in Oracle 8.
    • Rules for Parallel Insert
      • The insert statement must be of the form ‘insert into table_name select …’
      • The table being modified must have a specified parallel declaration or you must specify a parallel hint directive in the insert statement
      • You can perform parallel insert on non-partitioned as well as partitioned tables
      • After the parallel DML is complete no other SQL statements can access the same table until a Commit is issued.
    • Rules for Parallel Update and Delete
      • Table table must have a parallel declaration specified or you must specify a parallel hint directive in the update/delete statement
      • You can perform parallel update or delete on partitioned tables only
      • You cannot see the result of the parallel update or delete during the transaction
  • By using the parallel options, data intensive SQL statements, database recovery, and data loads can be executed by multiple processes simultaneously. All the following operations can be executed in parallel
    • table scan
    • sort merge join
    • Not In
    • select distinct
    • aggregation
    • cube
    • create table as select
    • rebuild index partition
    • move partition
    • update
    • Insert ….. select
    • Enable constraint
    • PL/SQL functions called from SQL
    • Nested loop join
    • Hash join
    • Group by
    • Union and union all
    • Order by
    • Rollup
    • Create index
    • Rebuild index
    • Split partition
    • Delete
    • Star transformation

To view the cover page and the table of contents click on the image at the top of this post or click here.

My Oracle Magazine Collection can be found here. You will find links to my blog posts on previous editions and a PDF for the very first Oracle Magazine from June 1987.

Monday, March 11, 2013

Oracle Magazine–March/April 1999

The headline articles for the March/April 1999 edition of Oracle Magazine were on the evolving world of the DBA. With some much new technology available in the database the role of the DBA is moving from a back office type role to one having a significant strategic influence in the organisation.
image
Other articles included:
  • Oracle releases a web based version of their Oracle Strategic Procurement application that includes three key parts: Strategic Sourcing, Internet Procurement and Process Automation.
  • Sun and Oracle announce a strategic agreement that allows both companies to enhance their product offerings by exchanging key technologies. Oracle will use the core of the Sun Solaris operating environment to deliver the industry’s first database server appliances.
  • Oracle Data Mart Suite releases version 2.5. It includes, Oracle Data Mart Builder, Oracle Data Mart Designer, Oracle 8 Enterprise Edition, Oracle Discoverer, Oracle Application Server and Oracle Reports and Reports Server.
  • New integration between Oracle Reports release 6.0 and Oracle Express Server release 6.2 to give users the ability to distribute high quality reports of information held in a multi-dimensional database across the enterprise.
  • The need for the DBA to know and understand the V$ views has been increasing during the later releases of 7.3 and 8i. The can be used for a variety of purposes, including understanding locked users, system resources, licencing and parameter settings.
  • One thing that all DBAs need to plan for is a database recovery. Planning it is one thing, but practicing it is another thing. A typical recovery plan will include, choosing a data file, create a backup, take the damaged tablespace offline, restore the damaged data file, bring the tablespace back online, recover the tablespace, bring the tablespace back online and test it.
  • Avoiding trigger errors, including Mutating and constraining table errors.
  • There is an article by Bryan Laplante on using Historgrams to Optimize Data Mart Performance.

To view the cover page and the table of contents click on the image at the top of this post or click here.
My Oracle Magazine Collection can be found here. You will find links to my blog posts on previous editions and a PDF for the very first Oracle Magazine from June 1987.

Wednesday, February 13, 2013

Clustering in ODM–Part 1

This is a the first part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

  1. This post part we will look at what clustering features exist in ODM and how to setup the data that we will be using in the examples
  2. The second part will focus on how to building Clusters and examining the clusters produced in ODM .
  3. The third post will focus on using the Clusters to apply to new data using ODM.
  4. The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
  5. The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

Clustering is an unsupervised technique designed groupings of related data that are more similar to each other and are less similar to other groups.  Typically clustering is used in customer segmentation analysis to try an better understand what type of customers you have.

Like with all data mining techniques, Clustering will not tell you or give you some magic insight into your data. Instead it gives you more information for you to interpret and add the business meaning to them. With Clustering you can explore the data that forms each cluster to understand what it really means.

The Clusters give by Oracle Data Miner are just patterns that it has found in the data.

image

Oracle has two Clustering algorithms:

K-Means : Oracle Data Miner runs an enhanced version of the typical k-means algorithm. ODM builds models in a hierarchical manner, using a top-down approach with binary splits and refinements of all nodes at the end. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree grows. The tree grows one node at a time. The node wit the largest variance is split to increase the size of the tree until the desired number of clusters is reached.

O-Cluster : O-Cluster is an Orthogonal Partitioning Clustering that creates a hierarchical grid based clustering model. It operates recursively, generating a hierarchical structure. The resulting clusters define dense areas.

The Data Set for out Clustering examples

I’m going to use a data set that is available on OTN (somewhere) and has been used for demos in the prior versions of ODM before the 11gR2 version (SQL Developer 3). It has gone by many names but the table name we care going to use is INSURANCE_CUST_LTV.

The file is in CSV format and we will use the Import feature in SQL Developer to import it.

1. In the schema you are using for Oracle Data Miner, right click Tables in the Connections tab. The Import option will appear on the menu. Select this.

image

2. Go to the directory where you saved the file, select it and then click on the Open button.

SNAGHTML60a28d3

3. You need to set the file Format to be ‘Delimited’ and the Delimiter set to ‘|’

SNAGHTML60c6d04

4. In the next step give the table name as INSURANCE_CUST_LTV

5.In the next step Select all the Attributes. It should default to this. Click next.

6. In Step 4 of the Wizard you can set the data types for each attribute. The simplest way is to set the character attributes to VARCHAR2 (50) :

CUSTOMER_ID,  LAST,  FIRST,  STATE,  REGION,  SEX,  PROFESSION,  BUY_INSURANCE (set this one to 3), MARITAL_STATUS, LTV_BIN

Set all the number attributes (all the others) to NUMBER without any precision or scale.

7. Click the next button and then the finish button.  SQL Developer will now load 15,342 records into the INSURANCE_CUST_LTV table, with no errors (hopefully!)

 

We are now ready to start our work with the Clustering algorithms in ODM.

In the next blog post we will look at exploring the data, building our Clustering models and examining the clusters that were produced by ODM.

Thursday, February 7, 2013

OUG Ireland 2013 Agenda is now live

The agenda for the OUG Ireland 2013 event is now live. The event will be in the Dublin Convention Centre on the 12th March. There are lots of excellent sessions, across 7 tracks!!  So there will be something (or lots of things) for everyone who works in the Oracle world here in Ireland.

I’m sure the Oracle Database track will be very popular. I wonder why!!!

Agenda : http://www.ukoug.org/2013-events/oug-ireland-2013/agenda/

Remember registration is FREE. You don’t have to be a member of the User Group to come to this event. It is open to everyone and did I mention that it is FREE.  Registration is now open.

I’ll be there. Well I suppose I have to as I’ll be presenting Smile

I hope to see you there.

agenda launch oug ireland

Wednesday, February 6, 2013

Oracle ACE Director

Towards the end of last week I received and email from Oracle saying that I had been nominated and accepted, by Oracle, to become an Oracle ACE Director.

This is something that makes me very proud and honours the work I have been doing over the past few years on Data Mining in Oracle (Advanced Analytics Option, Data Science/Predictive Analaytics, or whatever you want to call it). Thank you to everyone who nominated me.

If you are not familiar with the Oracle ACE Program, it is a way for Oracle to acknowledge not only technical skills but also personal engagement with the Oracle Community and Technology overall. There is even a FAQ that explains how this program works.

There are a few perks that come with the title, and Oracle have a few  expectations too. Most of these expectations I’m already doing!!  What I’m looking forward to later this years is my first Oracle ACE Director briefing at Oracle Open World (22-26 September)

Wednesday, January 30, 2013

Oracle Magazine-Nov/Dec. 1998

The headline articles for the Nov/Dec 1998 edition of Oracle Magazine were on building web based applications and thin client computing. A large part of the magazine was dedicated to these topics.  This was a bumper edition with a total of 152 pages of content.

image

Other articles included:

  • There was a few articles on using Oracle 8i, including how to use Java in the Database, the Internet File System, Intermedia and Data Warehousing.  Oracle 8i comes with over 150 new features
  • There was a couple of articles on the Millennium Bug and how to approach such projects. There was also some advice for organisations who would have to look at how to deal with the introduction of the Euro currency in Europe.
  • There was a section for articles on new product announcements from Oracle partners, including Quest, Nextek, Maxager, ObjectShare, Constellar (Warehouse Builder), Prism, DataMetrics, IQ Software, Eventus, DataMirror, Precise, Saville, DataShark, J-Database Exchange, Andataco, GeoMedia
  • Oracle makes available Oracle 8i and the Application Server on a Linux platform for the first time.
  • With Oracle 8i we have a number of ways of managing our constraints, including:
    • Deferrable integrity constraints
    • Non unique indexes for primary key and unique constraints
    • Immediate constraint enabling
  • Detecting lock and waiting transactions was always a task that consumed a lot of time for a DBA. A number of scripts was given to help you identify these and to resolve these problems.
  • For allow of Oracle Certified DBAs out there. There was an article promoting the OCP DBA program and Exam. Some hints and tips about the exam were given, along with some practice questions.
  • Plus there was 12 pages on adverts at the back of the magazine.

To view the cover page and the table of contents click on the image at the top of this post or click here.

My Oracle Magazine Collection can be found here. You will find links to my blog posts on previous editions and a PDF for the very first Oracle Magazine from June 1987.

Friday, January 25, 2013

OUG Norway Agenda is now live

The OUG Norway spring conference (17th April – 19th April) agenda is now live and is open for registrations.

Click here for the Conference Agenda

Click here for the Conference Registration

This is a 3 day conference. The first day (17th April) will be held in the Radisson BLU Scandinavia ( Holbergsplass ) and the next two (and a bit) days will be on the Color Magic boat that will be travelling between Oslo and Kiel in Germany and back to Oslo. The boat will be arriving back in Oslo on the Saturday morning (20th April).

There will be some presentations in Norwegian, but it looks like most of the presentations will be in English. There will also be some well known names from the Oracle world presenting at this conference.

In addition to these people, I will be giving two presentations on using Predictive Analytics in Oracle using the Oracle Data Miner tool and in-database functionality.

My first presentation will be an overview of the advanced analytics option and a demonstration of what you can do using the Oracle Data Miner tool (part of SQL Developer). This presentation is currently scheduled for Thursday (18th April) at 5pm.

My second presentation will be at 9:30am on the Friday morning (19th April). In this presentation we will look at the in-database features, what can we do in SQL and PL/SQL, and we will look at what you need to do deploy you Oracle Data Mining models in a production environment.

If possible we might be able to review some new 12c new features for Oracle Data Miner Smile

Saturday, January 19, 2013

BIWA Oracle Data Scientist Certificate

Last week I had had the opportunity to present at the BIWA Summit conference. This was held in the Sofitel Hotel beside the Oracle HQ buildings at Redwood Shores just out side of San Francisco.

This conference was a busy 2 days of with 4 parallel streams of presentations and another stream for Hands-on Labs. The streams covered Big Data, Advanced Analytics, Business Intelligence and Data Warehousing. There was lots of great presentations from well known names in the subject areas.

The BIWA Oracle Data Scientist Certificate was launched at the summit. The requirements for this certificate was to attend my presentation on ‘The Oracle Data Scientist’ (this was compulsory) and then to attend a number of other data science related presentations and hands-on labs. In addition to these presentations there is a short exam to take. This consists of some 30-ish questions, which were based on my presentation and some of the other presentations and hand-on labs. The main topic areas covered in the exam include what is data science about, Oracle Data Miner, Oracle R Enterprise and then some questions based on the key notes, in particular the keynote by Ari Kaplan.

There are a few days left to take the exam. Your answers to the questions will be reviewed and you should receive an email within a couple of days with your result and hopefully your certificate.

https://www.surveymonkey.com/s/BiwaSummitDataScientistCertificate

image

This was my first trip to Redwood Shores and I had some time to go for a walk around the Oracle HQ campus. Hopefully it wont be my last. Here is a photo I took of some of the Oracle buildings.

image

The BIWA Summit conference returns to Redwood Shores again in 2014 around the 14th and 15th January. It will be in the Oracle Conference centre that is part of the Oracle HQ campus.

Maybe I’ll see you there in 2014.

Thursday, January 17, 2013

The ‘Oh No You Don’t’ of (Oracle) Data Science

Over the past couple of weeks I’ve had conversations with a large number of people about Data Science in the Oracle arena.

A few things have stood out. The first and perhaps the most important of these is that there is confusion of what Data Science actually means. Some think it is just another name for Statistics or Advanced Statistics, some Predictive Analytics or Data Mining, or Data Analysis, Data Architecture, etc.. The reality is it is not. It is more than what these terms mean and this is a topic for discussion for another day.

During these conversations the same questions or topics keep coming up and the simplest answer to all of these is taken from a Pantomime (Panto).

We need to have lots of statisticians
       'Oh No You Don't !'
We can only do Data Science if we have Big Data
        'Oh No You Don't !'
We can only do data mining/data science if we have 10’s or 100’s of Million of records
        'Oh No You Don't !'
We need to have an Exadata machine
        'Oh No You Don't !'
We need to have an Exalytics machine
        'Oh No You Don't !'
We need extra servers to process the data
        'Oh No You Don't !'
We need to buy lots of Statistical and Predictive Analytics software
        'Oh No You Don't !'
We need to spend weeks statistically analysing a predictive model
        'Oh No You Don't !'
We need to have unstructured data to do Data Science
        'Oh No You Don't !'
Data Science is only for large companies
        'Oh No You Don't !'
Data Science is very complex, I can not do it
        'Oh No You Don't !'

Let us all say it together for one last time ‘Oh No You Don’t

In its simplest form, performing Data Science using the Oracle stack, just involves learning and using some simple SQL and PL/SQL functions in the database.

Maybe we (in the Oracle Data Science world and those looking to get into it) need to adopt a phrase that is used by Barrack Obama of ‘Yes We Can’, or as he said it in Irish when he visited Ireland back in 2011, ‘Is Feidir Linn’.

Remember it is just SQL.

Friday, January 4, 2013

My Blog Stats for 2012

Here are the stats from my blog for 2012.

In total I’ve had almost 28,000 blog post views. This is a 7 fold increase on the number of blog post views I had in 2011.

I had 92 blog posts in 2012 and the most popular blog posts were

Top search keywords used to find my blog

  • exalytics pricing
  • oracle data mining
  • oracle data miner
  • data science
  • brendan tierney

Top Countries

  • United States  52%
  • Ireland  8%
  • United Kingdom  8%
  • India  4%
  • Russia  4%
  • Germany  3%
  • France  3%
  • Netherlands  1%
  • Canada  1%
  • Turkey  1%

Top OS

  • Windows  59%
  • Macintosh  28%
  • Linux  5%
  • iPhone  2%
  • iPad  1%

Top Browsers

  • Firefox  47%
  • Internet Explorer  26%
  • Chrome  15%
  • Safari  4%

Wednesday, January 2, 2013

OUG Norway April 2013 - New Year’s News

I received an email at 23:24 on the 1st January from the OUG in Norway telling me that I’ve had two presentations accepted for the Annual OUG Norway seminar event. This will be on during the 17th-19th April.

The first day of this event (17th April) will be held in a hotel in Oslo. Then on the morning of 18th April we board the Color Magic cruise for the next two days of the conference. The ferry/cruise will go from Oslo to Kiel in Germany and then back again to Oslo, returning around 10am on Saturday 20th April.

I will be giving two presentations on the Oracle Advanced Analytics Option. The first presentation, ‘Using Predictive Analytics in Oracle’, will give an overview of the Oracle Advanced Analytics Option and will then focus on the Oracle Data Miner work-flow tool. This will presentation will include a live demo of using Oracle Data Miner to create some data mining models.

The second presentation, ‘How to Deploy and Use your Oracle Data Miner Models in Production’, builds on the examples given in the first presentation and will show how you can migrate, user and update your Oracle Data Miner models using the features available in SQL and PL/SQL. Again a demo will be given.

Thursday, December 20, 2012

Articles wanted for Oracle Scene–Spring 2013

The Call for Articles is now open for the Spring edition of Oracle Scene magazine. This is a publication of the UKOUG.

We are looking for technical articles covering all product offerings from Oracle. 

Typically articles will range from 3 pages to 8 pages (MS Word format). These will convert into 2 to 5 page articles in Oracle Scene.

Check out the Article Formatting Guidelines before submitting.
All pictures and images should be 300dpi.
Include a 100(max) word Bio and your photo
Email your article and images to

articles@ukoug.org.uk
For more details about submitting an article, check out
http://www.ukoug.org/what-we-offer/oracle-scene/article-submissions/

Wednesday, December 19, 2012

Association Rules in ODM-Part 4

This is a the final part of a four part blog post on building and using Association Rules in the Oracle Database using Oracle Data Miner. The following outlines the contents of each post in the series on Association Rules

  1. This first part will focus on how to building an Association Rule model
  2. The second post will be on examining the Association Rules produced by ODM – This blog post
  3. The third post will focus on using the Association Rules on your data.
  4. The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions.

In my previous posts I showed how you can go about setting up for Association Rule analysis in Oracle Data Miner and how to examine the rules that are generated.

This post will focus on how we build and use association rules using the functionality that is available in SQL and PL/SQL.

Step 1 – Build the Settings Table

As with all Oracle Data Mining functions in SQL and PL/SQL you will need to setup or build a settings table. This table contains all the settings necessary to run the model build functions. It is a good idea to create a separate settings table for each model build that you complete.

CREATE TABLE assoc_sample_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000));

Step 2 – Define the Settings for the Model

Before you go to generate your model you need to set some of the parameters for the algorithm. To start with you need to defined that we are going to generate an Association Rules model, turn off the Automatic Data Preparation.

We can also set 3 additional settings for Association Rules.


image



The ASSO_MIN_SUPPORT has a default of 0.1 or 10%. That means that only rules that exist in 10% or more of the cases will be generated. This is really a figure that is too high. In the code below we will set this to a 1%. This matches the settings that we used in SQL Developer in my previous posts.


BEGIN



INSERT INTO assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.algo_name, dbms_data_mining.ALGO_APRIORI_ASSOCIATION_RULES);



INSERT into assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off);



INSERT into assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.ODMS_ITEM_ID_COLUMN_NAME, ‘PROD_ID’);



INSERT into assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.ASSO_MIN_SUPPORT, 0.01);



COMMIT;



END;

/



Step 3 – Prepare the Data



In our example scenario we are using the SALE data that is part of the SH schema. The CREATE_MODEL function needs to have an attribute (CASE_ID) that identifies the key of the shopping basket. In our case we have two attributes, so we will need to use a combined key. This combined key consists of the CUST_ID and the TIME_ID. This links all the transaction records related to the one shopping event together.



We also just need the attribute that has the information that we need. In our Association Rules (Market Basket Analysis) scenario, we will need to include the PROD_ID attribute. This contains the product key of each product that was included in the basket



CREATE VIEW ASSOC_DATA_V AS (

SELECT RANK() OVER (ORDER BY CUST_ID, TIME_ID) CASE_ID,


t.PROD_ID


FROM SH.SALES t );



Step 4 – Create the Model



We will need to use the DBMS_DATA_MINING.CREATE_MODEL function. This will use the settings in our ASSOC_SAMPLE_SETTINGS table. We will use the view created in Step 3 above and use the CASE_ID attribute we created as the Case ID in the function all.



BEGIN 
   DBMS_DATA_MINING.CREATE_MODEL( 
     model_name          => 'ASSOC_MODEL_2', 
     mining_function     => DBMS_DATA_MINING.ASSOCIATION, 
     data_table_name     => 'ASSOC_DATA_V', 
     case_id_column_name => ‘CASE_ID’, 
     target_column_name  => null, 
     settings_table_name => 'assoc_sample_settings');


END;



On my laptop this took approximately 5 second to run on just over 918K records involving just over 143K cases or baskets.



Now that is quick!!!



Step 5 – View the Model Outputs



There are a couple of functions that can be used to extract the rules produced in our previous step. These include:



GET_ASSOCIATION_RULES : This returns the rules from an association model.



SELECT rule_id, 
       antecedent, 
       consequent, 
       rule_support,


       rule_confidence


FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES('assoc_model_2', 10));



The 10 here returns the top 10 records or rules.image GET_FREQUENT_ITEMSETS : returns a set of rows that represent the frequent item sets from an association model. In the following code we want the top 30 item sets to be returned, but filtered to only display item sets where there are 2 or more rules.



SELECT itemset_id,

       items,


       support,


       number_of_items


FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS('assoc_model_2', 30))


WHERE number_of_items >= 2;


image

Tuesday, December 18, 2012

BIWA Summit–9th & 10th January, 2013

The BIWA Summit will be on the 9th and 10th January, 2013. It is being held in the Sofitel Hotel beside the Oracle HQ at Redwood Shores, just outside of San Francisco.

The BIWA Summit looks to be leading event in 2013 focused on Analytics, Data Warehousing, Big Data and BI. If you are a data architect or a data scientist this is certainly one event that you should consider attending in 2013.

All the big names (in the Oracle world) will be there Tom Kyte, Mark Rittman, Maria Colgan, Balaji Yelmanchili, Vaishnavi Sashikanth, Charlie Berger, Mark Hornick, Karl Rexter, Tim and Dan Vlamis.

Oh and then there is me. I’ll be giving a presentation on the Oracle Data Scientist. This will be on the first day of the event (9th) at 11:20am.

For anyone interest in the Oracle Data Scientist World there are lots of presentations to help you get start and up to speed in this area. Here is a list of presentations and hands on labs that I can recommend.

image

As is typical with all good conferences there are many presentations on at the same time that I would like to attend. If only I could time travel.

This is a great event to start off the new year and for everyone who is thinking of moving into or commencing a project in the area. So get asking you manager to see if there is any training budget left for 2012 or get first dibs on the training budget for 2013.

Registration is open and at the moment the early bird discount still seems to be available. You can also book a room in the hotel using the registration page.

To view the full agenda – click here

Tom Kyte in Dublin 21st January 2013

Tom Kyte will be back in Dublin on the 21st January, 2013.  He will be giving a number of presentations covering some of his popular Oracle Open World sessions and will also include a AskTom session

It will be a full day, kicking off at 9am and finishing around 3:30pm.

There is no better way to kick off the new year with a full day of FREE Oracle training and up skilling with Tom Kyte.

To register for the event send an email to marketing-ie_ie@oracle.com

As they say places are limited, so book early,  I have Smile so I’ll see you there.

Friday, December 14, 2012

OUG Ireland 2013–Call for Presentations

The call for presentations at the OUG Ireland Conference is now open. The conference will be on Tuesday 12th March in Dublin city centre.

It is hoped to have at a number of concurrent tracks covering all the main topic areas, including application development, database administration, business intelligence, applications, etc.

If you are interested in submitting a presentation then you need to fill in some of the detail at

OUG Ireland – Submit a Paper

Follow the OUG Ireland conversation on twitter using the tag  #oug_ire

call for papers

Thursday, November 29, 2012

Association Rules in ODM-Part 3

This is a the third part of a four part blog post on building and using Association Rules in Oracle Data Miner. The following outlines the contents of each post in the series on Association Rules

  1. This first part will focus on how to building an Association Rule model
  2. The second post will be on examining the Association Rules produced by ODM – This blog post
  3. The third post will focus on using the Association Rules on your data.
  4. The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions.

In my previous posts I showed how you can go about setting up for Association Rule analysis in Oracle Data Miner and how to examine the rules that are generated.

This post will focus on how we can extract and use these rules in Oracle Data Miner.

Step 1 – Model Details

Association Rules are an unsupervised method of data mining. In Oracle Data Miner we cannot use the Apply node to to score new data. What we have to do is to generate the Model Details. These in turn can then be used.

The Model Details node is used when we do unsupervised learning to extract the rules that are generated.

To do this we need to click on the Model Details node in the Models section of the Component Palette and then click on our workspace, just to the right of the Association Rule node.

The Edit Model Selection window will open. Connect the Association Rule node to the Model Details node. Then Run the node. This will then generate the Association Rules in a format what we can reuse.

image

When you get the small green tick on the Model Details node you can then view what was generated.

Right click on the Model Details node and click on View Details from the menu.

image

The output is similar to what we would have seen under the Association Rule node with the addition of a few more attributes that include the schema name and model name.

We can order the rules based on the Confidence level by double clicking on the Confidence column header. You might need to do this twice to get the rule appearing based on a descending confidence value.

At this point we can no look at persisting the Association Rules. See step 2 below.

We can also view the SQL that was used to generate the Association Rules that we see in the Model Details node. While still viewing the rules, click on the SQL tab.

image

Step 2 – Persisting the Association Rules

To make the rules persist and be useable outside of ODM we can persist the Association Rules in a table. The first step to do this is to create a new Table Node. This can be found under the Data section of the Component Palette. Click this Create Table or View node in the component palette and then click on the workspace, just to the right of the Model Details node.

Connect the Model Details node to the Output node, by right clicking on the Model Details node, select Connect from the menu and then click on the Output Node.

We can now edit the format of the Output i.e. specify what attributes are to be in our Output table. Double click on the Output node or right click and select Edit from the menu. We now get the Edit Create Table or View Node.

SNAGHTML18801036

We can give the output a meaningful name e.g. AR_OUTPUT_RULES. We can also specify what rule properties we can to export to attributes in out table.

We will need to un-tick the Auto Input Columns Selection tick box before we can remove any of the output attributes. In my case I only want to have ANTECENDENT_ITEMS, CONSEQUENT_ITEMS, ID, LENGTH, CONFIDENCE and SUPPORT in my out put. So I need to select and highlight all the other attributes (holding the control button). After selecting all the attributes I do not want included in the final output table, I need to click on the red X icon.

SNAGHTML18859128

When complete click on the OK button to go back to the workflow.

To generate the table right click on the AR_OUTPUT_RULES node and select Run from the menu. When you get the green tick mark on the AR_OUTPUT_RULES node the table has been created with records containing the details of each rules.

image

To view the contents of the AR_OUTPUT_RULES table we can right click on this node and select view data from the menu.

image

We can now use these rules in our applications.

 

Check out the next post in the series (Part 4) where we will look at the functionality available in the ODM SQL & PL/SQL functions to perform Association Rule analysis.

Tuesday, November 27, 2012

Association Rules in ODM–Part 2

This is a the second part of a four part blog post on building and using Association Rules in Oracle Data Miner.  The following outlines the contents of each post in the series on Association Rules

  1. This first part will focus on how to building an Association Rule model
  2. The second post will be on examining the Association Rules produced by ODM – This blog post
  3. The third post will focus on using the Association Rules on your data.
  4. The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions. 

In the previous post I looked at the steps needed to setup a data source and to setup the Association Rule node. When everything was setup we ran the workflow.

Step 1 – Viewing the Model

We the workflow has finished running we will have the green tick marks on each node. This is where we left thing at the end of the previous post (Part 1). To view the model details, right click on the Association Role Node and select View Models from the menu.

image

There are 3 main concepts that are important in relation to Association Rules:

  • Support: is the proportion of transactions in the data set that contain the item set i.e. the number of times the rule occurs
  • Confidence: is the proportion of the occurrences of the antecedent that result in the consequent e.g. how many times do we get C when we have A and B  {A, B} => C
  • Lift: indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent

Support and Confidence are the primary measures that are used to access the usefulness of an association rule.

In our example we can see that the the antecedent and the consequent has numbers separated by the word AND. These numbers correspond to the product numbers.

Step 2 – Examining the Model Rules

To read the antecedent and the consequent for the first rule in our example we have:

Antecedent: 137 AND 143 AND 128

Consequent: 144

To read this association rule we would say that if a Customer bought product 137 and product 143 and product 128, then we have a Confidence value of almost 71%. This is a strong association.

We can check the ordering of the rules by changing the Sort By criteria. As Confidence and Support are the main ways to evaluate the rules, we can change the Sort By criteria to be Confidence. Then click on the Query button to refresh the rules section.

image

Here get a list of the strongest rules listed in descending order.

Below the section of the screen that has the Rules, we have the Rule Details section.

image 

Here we can see that the rule gets formatted into an IF statement. The first rule in the list has a confidence of almost 97%. As it is a simple IF statement it can be easily implemented in our applications.

We want use the information that these rules provides in a number of ways. One such consequence of these rules is that we can look at improving the ordering and distribution of these products to ensure that we have sufficient numbers of each. Another consequence is that we can enhance the front end selling mechanism to make sure that if a customer is buying product 114, 118 and 115 then we can remind the customer of product 119. We can also ensure that all these products are not located beside each other, so that the customer will have to walk past many other products in order to find them. That is why we never see milk and bread beside each other in a grocery store.

Step 3 – Applying Filters to the Model Rules

In the previous step we were able to sort our rules based on some of the measures of our Association Rules and to see how these rules are structured.

Association Rule Analysis can generate many thousands of possible rules for a small data set. In some cases the similar rules can appear and we can have lots of rules that occur so infrequently that they are perhaps meaningless.

ODM provides us with a number of filters that we can apply to the rules that enables use to look for the rules that are of must interest to use. We can access these filters by clicking on the More button, that is located just under the Query button.

We can refine our query on the rules based on the various measures and the number if items in the rule. In addition to this we can also filter based on the values of the items. This is particularly useful if we want to concentrate on specific items (in our example Products). To illustrate this use focus on the rules that involve Product 115. Click on the green + symbol on the right hand side of the window. Select 115 from the list provided. Next we need to decide if we want Product 115 involved in the Antecedent or the Consequent. In our example select the Consequent. This is located to the bottom right of the window. Then click the OK button and then click on the Query button to update the list of rules that correspond with the new filter.

image

We can see that we only have rules that have Product 115 in the Consequent column.

We can also see that we have 134 rules for this scenarios out of a total of 20,988 (your results might differ slightly to mine and that’s OK. It really depends on what version of the sample data you are using)

 

Check out the next post in the series (Part 3) where we will look at how you can use the Association Rules produced by ODM.

Friday, November 23, 2012

Association Rules in ODM–Part 1

This is a the first part of a four part blog post on building and using Association Rules in Oracle Data Miner. The following outlines the contents of each post in the series on Association Rules

  1. This first part will focus on how to building an Association Rule model
  2. The second post will be on examining the Association Rules produced by ODM – This blog post
  3. The third post will focus on using the Association Rules on your data.
  4. The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions.

image

The data set we will be using for Association Rule Analysis will be the sample data that comes with the SH schema in the database. Access to this schema and it’s data was setup when we created our data mining schema and ODM Repository.

Step 1 – Getting setup

As with all data mining projects you will need a workspace that will contain your workflows. Based on my previous ODM blog posts you will have already created a Project and some workflows. You can either reuse an existing workflow you have used for one of the other ODM modeling algorithms or you can create a new Workflow called Association Rules.

Step 2 – Define your Data Set

Assuming that your database has been setup to have the Sample schemas and their corresponding data, we will be using the data that is in the SH schema. In a previous post, I gave some instructions on setting up your database to use ODM and part of that involved a step to give your ODM schema access to the sample schema data.

We will start off by creating a Data Source Node. Click on the Data Source Node under the Component Palette. Then move your mouse to your your workspace area and click. A Data Source Node will be created and a window will open. Scroll down the list of Available Tables until you find the SH.SALES table. Click on this table and then click on the Next button. We want to include all the data so we can now click the Finish Button.

image

Our Data Source Node will now be renamed to SALES.

Step 3 – Setup the Association Build Node

Under the Model section of the Component Palette select Association. Move the mouse to your work area (and perhaps just the to right of the SALES node) click. Our Association Node will be created.

image

For the next step we need to join the our data source (SALES) with the Association Build Node. Right click on the SALES data node and select Connect from the drop down menu. Then move the mouse to the Association Build node and click. You should now have the two nodes connected.

We will now get the Edit Association Build Node property window opening for us. We will need to enter the following information:

  • Transaction ID: This is the attribute(s) that can be used to uniquely identify each transaction. In our example the Customer ID and the Time ID of the transaction allows us to identify what we want to analyse by i.e. the basket. This will group all the related transactions together
  • Item ID: What is the attribute of the thing you want to analyse. In our case we want to analyse the Products purchased, so select PROD_ID in this case
  • Value: This is an identifier used to specify another column with the transaction data to combine with the Item ID. <Existence> means that you want to see if there are any type of common bundling among all values of the selected Item ID. Use this.

image

Like all data mining products, Oracle has just one Algorithm to use for Association Rule Analysis, the Apriori Algorithm.

Click the OK button. You are now ready to run the Association Build Node. Right click on the node and select Run from the menu. After a short time everything should finish and we will have the little green tick makes on each of the nodes.

image

 

Check out the next post in the series (Part 2) where we will look at how you can examine the rules produced by our model in ODM.