Monday, March 13, 2017
Tuesday, July 5, 2016
This is the fourth and last blog post in a series that looks at how you can examine the details of predicted clusters using Oracle Data Mining. In the previous blog posts I looked at how to use CLUSER_ID, CLUSTER_PROBABILITY and CLUSTER_SET.
In this blog post we will look at CLUSTER_DISTANCE. We can use the function to determine how close a record is to the centroid of the cluster. Perhaps we can use this to determine what customers etc we might want to focus on most. The customers who are closest to the centroid are one we want to focus on first. So we can use it as a way to prioritise our workflows, particularly when it is used in combination with the value for CLUSTER_PROBABILITY.
Here is an example of using CLUSTER_DISTANCE to list all the records that belong to Cluster 14 and the results are ordered based on closeness to the centroid of this cluster.
SELECT customer_id, cluster_probability(clus_km_1_37 USING *) as cluster_Prob, cluster_distance(clus_km_1_37 USING *) as cluster_Distance FROM insur_cust_ltv_sample WHERE cluster_id(clus_km_1_37 USING *) = 14 order by cluster_Distance asc;
Here is a subset of the results from this query.
When you examine the results you may notice that the records that is listed first and closest record to the centre of cluster 14 has a very low probability. You need to remember that we are working in a N-dimensional space here. Although this first record is closest to the centre of cluster 14 it has a really low probability and if we examine this record in more detail we will find that it is at an overlapping point between a number of clusters.
This is why we need to use the CLUSTER_DISTANCE and CLUSTER_PROBABILITY functions together in our workflows and applications to determine how we need to process records like these.
Thursday, June 23, 2016
This is the third blog post on my series on examining the Clusters that were predicted by an Oracle Data Mining model. Check out the previous blog posts.
- Part 1 - Examining predicted Clusters and Cluster details using SQL
- Part 2 - Cluster Details with Oracle Data Mining
In the previous posts we were able to list the predicted cluster for each record in our data set. This is the cluster that the records belonged to the most. I also mentioned that a record could belong to many clusters.
So how can you list all the clusters that the a record belongs to?
You can use the CLUSTER_SET SQL function. This will list the Cluster Id and a probability measure for each cluster. This function returns a array consisting of the set of all clusters that the record belongs to.
The following example illustrates how to use the CLUSTER_SET function for a particular cluster model.
SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
The output from this query will be an ordered data set based on the customer id and then the clusters listed in descending order of probability. The cluster with the highest probability is what would be returned by the CLUSTER_ID function. The output from the above query is shown below.
If you would like to see the details of each of the clusters and to examine the differences between these clusters then you will need to use the CLUSTER_DETAILS function (see previous blog post).
You can specify topN and cutoff to limit the number of clusters returned by the function. By default, both topN and cutoff are null and all clusters are returned.
- topN is the N most probable clusters. If multiple clusters share the Nth probability, then the function chooses one of them.
- cutoff is a probability threshold. Only clusters with probability greater than or equal to cutoff are returned. To filter by cutoff only, specify NULL for topN.
You may want to use these individually or combined together if you have a large number of customers. To return up to the N most probable clusters that are greater than or equal to cutoff, specify both topN and cutoff.
The following example illustrates using the topN value to return the top 4 clusters.
SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, 4, null USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
and the output from this query shows only 4 clusters displayed for each record.
Alternatively you can select the clusters based on a cut off value for the probability. In the following example this is set to 0.05.
SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, NULL, 0.05 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
and the output this time looks a bit different.
Finally, yes you can combine these two parameters to work together.SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, 2, 0.05 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
Tuesday, June 7, 2016
The 4 blog posts will consist of:
- 1 - (this blog post) will look at how to determine the predicted cluster and cluster probability for your record.
- 2 - will show you how to examine the details behind and used to predict the cluster.
- 3 - A record could belong to many clusters. In this blog post we will look at how you can determine what clusters a record can belong to.
- 4 - Cluster distance is a measure of how far the record is from the cluster centroid. As a data point or record can belong to many clusters, it can be useful to know the distances as you can build logic to perform different actions based on the cluster distances and cluster probabilities.
Right. Let's have a look at the first set of these closer functions. These are CLUSTER_ID and CLUSTER_PROBABILITY.
CLUSER_ID : Returns the number of the cluster that the record most closely belongs to. This is measured by the cluster distance to the centroid of the cluster. A data point or record can belong or be part of many clusters. So the CLUSTER_ID is the cluster number that the data point or record most closely belongs too.
CLUSTER_PROBABILITY : Is a probability measure of the likelihood of the data point or record belongs to a cluster. The cluster with the highest probability score is the cluster that is returned by the CLUSTER_ID function.
Now let us have a quick look at the SQL for these two functions. This first query returns the cluster number that each record most strong belongs too.
SELECT customer_id, cluster_id(clus_km_1_37 USING *) as Cluster_Id, FROM insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU6607', 'CU100');
Now let us add in the cluster probability function.
SELECT customer_id, cluster_id(clus_km_1_37 USING *) as Cluster_Id, cluster_probability(clus_km_1_37 USING *) as cluster_Prob FROM insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU6607', 'CU100');
These functions gives us some insights into what the cluster predictive model is doing. In the remaining blog posts in this series I will look at how you can delve deeper into the predictions that the cluster algorithm is make.
Thursday, March 17, 2016
In a previous blog post I showed how you can install and get started with using RStudio on a server by using RStudio Server. My previous post showed how you could do that on the Oracle BigDataLite VM. On this VM everything was nicely scripted and set up for you. But when it comes to installing it on a different server, well things can be a bit different.
The purpose of this blog post is to go through the install steps you need to follow on your own server or Oracle Database server. The following is based on a server that is setup with Oracle Linux. (I'm actually using the Oracle DB Developer VM).
1. Download the latest version of RStudio Server.
Use the following link to download RStudio Server. But do a quick check on the RStudio server to get the current version number.
The following shows you what you will see when you run this command.
--2016-03-16 06:22:30-- https://download2.rstudio.org/rstudio-server-rhel-0.99.892-x86_64.rpm Resolving download2.rstudio.org (download2.rstudio.org)... 220.127.116.11, 18.104.22.168, 22.214.171.124, ... Connecting to download2.rstudio.org (download2.rstudio.org)|126.96.36.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 38814908 (37M) [application/x-redhat-package-manager] Saving to: ‘rstudio-server-rhel-0.99.892-x86_64.rpm’ 100%[============================================================>] 38,814,908 6.54MB/s in 6.0s 2016-03-16 06:22:37 (6.17 MB/s) - ‘rstudio-server-rhel-0.99.892-x86_64.rpm’ saved [38814908/38814908]
2. Install RStudio Server
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.892-x86_64.rpm
when prompted if it is OK to install, enter y (highlighted in bold below)
Loaded plugins: langpacks Examining rstudio-server-rhel-0.99.892-x86_64.rpm: rstudio-server-0.99.892-1.x86_64 Marking rstudio-server-rhel-0.99.892-x86_64.rpm to be installed Resolving Dependencies --> Running transaction check ---> Package rstudio-server.x86_64 0:0.99.892-1 will be installed --> Finished Dependency Resolution ol7_UEKR3/x86_64 | 1.2 kB 00:00:00 ol7_addons/x86_64 | 1.2 kB 00:00:00 ol7_latest/x86_64 | 1.4 kB 00:00:00 ol7_optional_latest/x86_64 | 1.2 kB 00:00:00 Dependencies Resolved =========================================================================================================== Package Arch Version Repository Size =========================================================================================================== Installing: rstudio-server x86_64 0.99.892-1 /rstudio-server-rhel-0.99.892-x86_64 280 M Transaction Summary =========================================================================================================== Install 1 Package Total size: 280 M Installed size: 280 M Is this ok [y/d/N]: y Downloading packages: Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : rstudio-server-0.99.892-1.x86_64 1/1 groupadd: group 'rstudio-server' already exists rsession: no process found ln -s '/etc/systemd/system/rstudio-server.service' '/etc/systemd/system/multi-user.target.wants/rstudio-server.service' rstudio-server.service - RStudio Server Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled) Active: active (running) since Wed 2016-03-16 10:46:00 PDT; 1s ago Process: 3191 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=0/SUCCESS) Main PID: 3192 (rserver) CGroup: /system.slice/rstudio-server.service ├─3192 /usr/lib/rstudio-server/bin/rserver └─3205 /usr/lib64/R/bin/exec/R --slave --vanilla -e cat(R.Version()$major,R.Version()$minor,~+~sep=".") Mar 16 10:46:00 localhost.localdomain systemd: Started RStudio Server. Verifying : rstudio-server-0.99.892-1.x86_64 1/1 Installed: rstudio-server.x86_64 0:0.99.892-1 Complete!
3. Open RStudio using a web browser.
Open your favourite web browser and put in the host name or the IP address of your server. In my example I'm using the Oracle DB Developer VM to demonstrate the install, so I can use localhost, followed by the port number for RStudio Server.Log in using your Server username and password. This is oracle/oracle on the VM.
4. Use and Enjoy
If you get logged into RStudio Server then you will see a screen something like the following!
Job Done and Enjoy!
5. An Extra Step is using the Oracle DB Developer VM
If you want to use RStudio on the Oracle DB Developer VM from your local OS, then you will need to open the port 8787 on the VM. To do this power down the VM, if you have it open. The open the Network section of the VM settings. I'm using VirtualBox. And then click on the Port Forwarding.
Click on OK to save your Port Forwarding setting and then click on the OK button again to close the Network settings for the VM.
Now start up the VM. When it has loaded and you have the desktop displayed in the VM window, you should now be able to connect to RStudio in the VM, from your local machine.
To do this open your web browser on your local machine and type in
You should now get the RStudio login in screen that is shown in point 3 above. Go ahead, login and enjoy.
6. A little warning
Make sure to log out of RStudio when you are finished using it. If you don't then your R environment may not have been saved and you will get a message when you log in next. Now we don't want that happenings, so just log out of RStudio. You can do that by looking at the top right hand corner of the RStudio Server application.
I will have one more blog post on how you can configure RStudion Server to work with an Oracle Database server that has Oracle R Enterprise installed.
Monday, March 14, 2016
A very popular tool for data scientists is RStudio. This tool allows you to interactively work with your R code, view the R console, the graphs and charts you create, manage the various objects and data frames you create, as well shaving easy access to the R help documentation. Basically it is a core everyday tool.
The typical approach is to have RStudio installed on your desktop or laptop. What this really means is that the data is pulled to your desktop or laptop and all analytics is performed there. In most cases this is fine but as your data volumes goes does does the limitations of using R on your local machine.
An alternative is to install a version called RStudio Server on an analytics server or on the database server. You can now use the computing capabilities of this server to overcome some of the limitations of using R or RStudio locally. Now you will use your web browser to access RStudio Server on your database server.
In this blog post I will walk you through how to install and get connected to RStudio Server on the Oracle BigDataLite VM.
After starting up the Oracle BigDataLite VM and logging into the Oracle user (password=welcome1) you will see the Start Here icon on the desktop. You will need to double click on this.
This will open a webpage on the VM that contains details of all the various tools that are installed on the VM or are ready for you to install and configure. This information contains all the http addresses and ports you need to access each of these tools via a web browser or some other way, along with the usernames and passwords you need to use them.
One of the tools lists is for RStudio Server. This product is not installed on the VM but Oracle has provided a script that you can run to perform the install in an automated way. This script is located in:
[oracle@bigdatalite ~]$ cd /home/oracle/scripts/
Use the following command to run the RStudio Server install script.
[oracle@bigdatalite scripts]$ ./install_rstudio.shThe following is the output from running this script and it will be displayed in your terminal window. You can use this to monitor the progress of the installation.
Retrieving RStudio --2016-03-12 02:06:15-- https://download2.rstudio.org/rstudio-server-rhel-0.99.489-x86_64.rpm Resolving download2.rstudio.org... 188.8.131.52, 184.108.40.206, 220.127.116.11, ... Connecting to download2.rstudio.org|18.104.22.168|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 34993428 (33M) [application/x-redhat-package-manager] Saving to: `rstudio-server-rhel-0.99.489-x86_64.rpm' 100%[======================================>] 34,993,428 5.24M/s in 10s 2016-03-12 02:06:26 (3.35 MB/s) - `rstudio-server-rhel-0.99.489-x86_64.rpm' saved [34993428/34993428] Installing RStudio Loaded plugins: refresh-packagekit, security, ulninfo Setting up Install Process Examining rstudio-server-rhel-0.99.489-x86_64.rpm: rstudio-server-0.99.489-1.x86_64 Marking rstudio-server-rhel-0.99.489-x86_64.rpm to be installed public_ol6_UEKR3_latest | 1.2 kB 00:00 public_ol6_UEKR3_latest/primary | 22 MB 00:03 public_ol6_UEKR3_latest 568/568 public_ol6_latest | 1.4 kB 00:00 public_ol6_latest/primary | 55 MB 00:12 public_ol6_latest 33328/33328 Resolving Dependencies --> Running transaction check ---> Package rstudio-server.x86_64 0:0.99.489-1 will be installed --> Finished Dependency Resolution Dependencies Resolved ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: rstudio-server x86_64 0.99.489-1 /rstudio-server-rhel-0.99.489-x86_64 251 M Transaction Summary ================================================================================ Install 1 Package(s) Total size: 251 M Installed size: 251 M Downloading Packages: Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : rstudio-server-0.99.489-1.x86_64 1/1 useradd: user 'rstudio-server' already exists groupadd: group 'rstudio-server' already exists rsession: no process killed rstudio-server start/running, process 5037 Verifying : rstudio-server-0.99.489-1.x86_64 1/1 Installed: rstudio-server.x86_64 0:0.99.489-1 Complete! Restarting RStudio rstudio-server stop/waiting rsession: no process killed rstudio-server start/running, process 5066
When the installation is finished you are now ready to connect to the RStudio Server. So open your web browser and enter the following into the address bar.
The initial screen you are presented with is a login screen. Enter your Linux username and password. In the case of the BigDataLite VM this will be oracle/welcome1.
Then you will be presented with the RStudio Server application in your web browser, as shown below. As you can see it is very similar to using RStudio on your desktop. Happy Days! You are now setup and able to run RStudio on the database server.
Make sure to log out of RStudio Server before closing down the window.
If you don't log out of RStudio Server then the next time you open RStudio Server your session will automatically open. Perhaps this is not the best for security, so try to remember to log out each time.
By now using RStudio Server on the Oracle Database server I can not get some of the benefits of computing capabilities of this server. Although there are still the typical limitations with of using R. But now I access RStudio on the database server and process the data on the database server, all from my local PC or laptop.
Everything is nicely setup and ready for you to install on the BigDataLite VM (thank you Oracle). But what about when we want to install RStudion Server on a different server. What are the steps necessary to install, configure and log in. Yes they should be similar but I will give a complete list of steps in my next blog post.
Thursday, November 5, 2015
At Oracle Open World (OOW15) I gave 2 presentations on the Sunday during the Oracle User Group Forum. The slides are now available for download from the Oracle Open World website.
Go get them now!
During this sessions I was one of 16 presenters talking about various features in the Oracle Database. All of the presenters where from the EOUC region.
I co-presented with Antony Heljula from Peak Indicators. During this presentation we talked about some of the Advanced Analytics projects we have worked on over the past 18-24 months. We also announced a new Analytics-as-a-Service offering.
The slides are also available for most of the other Oracle Open World Presentations and these can be accessed here. Just go search for the topic you are interested in.
Thursday, March 12, 2015
Everyone is doing advanced analytics. Right? Hmm
Everyone is talking about advanced analytics? Yes that is true.
Everyone is an expert in advanced analytics? This is so not true. Watch out for these Great Pretenders. You know what I mean! You know who I mean! Maybe you know some of them already? If not, watch out for these Great Pretenders!!!
Some people are going around talking about data mining, predictive analytics, advanced analytics, machine learning etc as if this is some new topic. Well it isn't. It isn't anything new and most of the techniques have been about for 10, 20, 30+ years.
Some people are saying you should only use language X or tool Y because. Everything else is basically rubbish.
What we do have is a wider understanding of how to use these techniques on our various data sources.
What we have is a lot more tools that allow us to perform these tasks a lot easier, at greater speed, with more functionality and without the need to fully understand the hard core maths that is going on behind the scenes.
What we have is a lot more languages to perform these tasks and to support the vast amount of work that goes into understanding the data and preparing the data.
Someone thing for all of us to watch out for, when we ready about these topics, is what kind of problem area they are addressing. The following table illustrates the three main types or categories of Analytics. These categories are Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. I think most people would agree that the Descriptive and Predictive Analytics categories are very mature at this stage. With Predictive Analytics we are perhaps still evolving in this category and a lot more work needs to be done before this this become wide spread.
Some people talk as if Predictive Analytics is some new and exciting topic. But isn't all that new. It was been around for the past 30+ years. If you go back over the Gartner Hype Cycle that comes out every September, Predictive Analytics is no longer being shown on this graph. The last time it appeared on the Gartner Hype Cycle was back in 2013 and it was positioned on the far right of the graph in the section called Plateau of Productivity.
So Predictive Analytics is very mature and main stream. Part of the reason that it is main stream is that Predictive Analytics has allowed for a new category of Analytics to evolve and this is Automatic Analytics.
Automatic Analytics is where Advanced and Predictive Analytics has been build into our day to day applications that are used to run our business. We do not need the hard core type of data scientists to perform various analytic on our data. Instead these task, once they have been defined, can then be added to our applications to process, evaluate and make decisions all automatically. This is were we need the data scientists to be able to communicate with the business and be able to work with them to solve real world business projects. This is a different type of data scientist to the "hard" core data scientist who delves into the various statistical methods, machine learning methods, data management methods, etc.
The following table extends the table given above to include Automatic Analytics, and is my own take on how and where Automatic Analytics fits.
Every time we get an insurance quote, health insurance quote, get a "random" call from our Telco offering a free upgrade, get our loyalty card statements, get a loan from the bank, look at or buy a book on Amazon, etc. the list could go on and on, but these are all examples of how predictive analytics has been automated into our everyday business application.
But this is nothing new. When I first got into data mining/predictive analytics over 16 years ago, it was considered a common thing that certain types of companies did. What has happened in the time since and particularly in the past few years is that a lot more people are seeing the value in using it.
Before I finish off this post we can have a quick look at what Oracle has been doing in this area. They have their Advanced Analytics Option and Real-Time Decisions tools to all data scientists do their magic. But over the past X years (nobody can give me an exact number) they have been very, very active in building in lots and lots of predictive analytics into their various business applications, particularly with into with Fusion Apps and BI Apps.
A recent quote from Oracle highlights their aim with this,
" ... products designed to close the gap between data scientists and businesses."
Now with Oracle making a big push to the cloud, they are busy adding in more and more Automatic (Predictive) Analytics into their Cloud Applications. What we need from Oracle is a clearer identification of where they have done this. Plus with the migration of their Apps to the cloud, their Advanced Analytics Option is a core part of their Cloud platform. As they upgrade or add new features into their Cloud Apps, you will now be able to get the benefit of these Automatic (Predictive) Analytics as they come available.