Friday, February 12, 2016

Spark versus Flink

Spark is an open source Apache project that provides a framework for multi stage in-memory analytics. Spark is based on the Hadoop platform and can interface with Cassandra OpenStack Swift, Amazon S3, Kudu and HDFS. Spark comes with a suite of analytic and machine learning algorithm allowing you to perform a wide variety of analytics on you distribute Hadoop platform. This allows you to generate data insights, data enrichment and data aggregations for storage on Hadoop and to be used on other more main stream analytics as part of your traditional infrastructure. Spark is primarily aimed at batch type analytics but it does come with a capabilities for streaming data. When data needs to be analysed it is loaded into memory and the results are then written back to Hadoop.


Flink is another open source Apache project that provides a platform for analyzing and processing data that is in a distributed stream and/or batch data processing. Similarly to Spark, Flink comes with a set of APIs that allows for each integration in with Java, Scala and Python. The machine learning algorithms have been specifically tuned to work with streaming data specifically but can also work in batch oriented data. As Flink is focused on being able to process streaming data, it run on Yarn, works with HDFS, can be easily integrated with Kafka and can connect to various other data storage systems.


Although both Spark and Flink can process streaming data, when you examine the underlying architecture of these tools you will find that Flink is more specifically focused for streaming data and can process this data in a more efficient manner.

There has been some suggestions in recent weeks and months that Spark is now long the tool of choice for analytics on Hadoop. Instead everyone should be using Flink or something else. Perhaps it is too early to say this. You need to consider the number of companies that have invested significant amount of time and resources building and releasing products on top of Spark. These two products provide similar-ish functionality but each product are designed to process this data in a different manner. So it really depends on what kind of data you need to process, if it is bulk or streaming will determine which of these products you should use. In some environments it may be suitable to use both.

Will these tool replace the more traditional advanced analytics tools in organisations? the simple answer is No they won't replace them. Instead they will complement each other and if you have a Hadoop environment you will will probably end up using Spark to process the data on Hadoop. All other advanced analytics that are part of your more traditional environments you will use the traditional advanced analytics tools from the more main stream vendors.

Tuesday, February 2, 2016

ORE video : Demo Code Part 5

The following is the fifth and final set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following examples illustrate how you can use the Oracle R Enterprise capabilities within SQL and PL/SQL. The following illustrate building a GLM model using the glm algorithm that comes with the R language, and then uses this mode to score or label new data that is stored in a table. The last part of the example illustrates how you can perform What-If analysis using this ORE model

-- Build & save the R script, called Demo_GLM in the DB
--  This builds a GLM  DM model in the DB
      'function(dat,datastore_name) {
          mod <- glm(AFFINITY_CARD ~ CUST_GENDER + AGE + CUST_MARITAL_STATUS + COUNTRY_NAME + CUST_INCOME_LEVEL + EDUCATION + HOUSEHOLD_SIZE + YRS_RESIDENCE, dat, family = binomial()), name=datastore_name, overwrite=TRUE)   }');

-- After creating the script you need to run it to create the GLM model
select * 	
from table(rqTableEval(
             cursor(select CUST_GENDER,
                    from mining_data_build_v),
             cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name” from dual),
                  'XML', 'Demo_GLM' ));

-- There are 2 ways to use the GLM model : in Batch and in Real-Time mode
-- First Step : Build the in-database R script to score you new data
      'function(dat, datastore_name) {
      prd <- predict(mod, newdata=dat)
      prd[as.integer(rownames(prd))] <- prd
      res <- cbind(dat, PRED = prd)

-- Now you can run the script to score the new data in Batch model
--   The data is located in the table MINING_DATA_APPLY
select * from table(rqTableEval(
                     from   MINING_DATA_APPLY_V
                     where rownum <= 10),
             cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name" from dual),
order by 1, 2, 3;

-- Now let us use the Demo_GLM_Batch script to score data in Real-Time
--  The data values are passed to the GLM model
select * from table(rqTableEval(
              cursor(select 'M' CUST_GENDER,
                            23 AGE,
                            'Married' CUST_MARITAL_STATUS, 
                            'United States of America' COUNTRY_NAME,
                            'B: 30,000 - 49,999' CUST_INCOME_LEVEL, 
                            'Assoc-A' EDUCATION,
                            '3' HOUSEHOLD_SIZE, 
                            5 YRS_RESIDENCE
                     from dual),
              cursor(select 'myDatastore' "datastore_name", 1 "ore.connect" from dual),
              'Demo_GLM_Batch')) order by 1, 2, 3; 

Wednesday, January 27, 2016

BIWA Summit : HOL notes for ODM using SQL

The following notes and documents are for the BIWA Summit 2016 attendees who are taking the HOL on using ODM using SQL and PL/SQL.

Getting connected to the Cloud

The following document outlines the steps you need to perform to get connected to the Cloud Database we are using.

If you have attended the HOL (yesterday) then you can reuse the same connection details (and number).

If you didn't attend the HOL yesterday then you will need to be assigned a Number Ask me or Charlie for this. After you get assigned a number then follow the instructions.


It might be a good idea to download this document to your local machine.

HOL Notes

The following links are for the HOL documents. The first is the slides we will work through and this document contains the exercises that you will complete.


The second document is demonstration script and contains all the code that is in the HOL slides document. Download load this file and open it in Worksheet in SQL Developer that is connected to your schema in the database. (Do not use the ODM connections and no need to open the ODM GUI).


Wednesday, January 20, 2016

BIWA Summit 2016

The annual BIWA Summit 2016 will be next week from the 26-28 January, and it is back in Oracle HQ at Redwood Shores. If you are into the Oracle Database, Business Intelligence, Big Data, Advanced Analytics, etc then this is the conference for you.


Over the 3 days there is an action packed agenda of 5 parallel tracks, plus a full 3 days of Hands-on Labs. The agenda is filled with the whos-who of the Oracle BI and Analytics world, so if this is your area then BIWA Summit is the conference for you and your training budget. (I'm sure it is not too late to book your place)

I've been lucky this year in that I will have 2 Hands-on Labs and 1 presentation to give. Yes that is 5 hours of presenting/hosting to do. The presentation I will be giving is 'Is Oracle SQL the best language for Statistics?' (on Tuesday 26th). This presentation is listed for the BIWA Summit and also for the NoCOUG Yes SQL conference that is running at the same time in the same venue (i.e. co-located). I've also written a brand new 2-hour Hands-on Lab titled 'Predictive Analytics using SQL and PL/SQL'. The first outing for this will be on Wednesday 27th. I will also be co-hosting, with Charlie Berger, the 'Learn Predictive Analytics in 2hours with Oracle Data Miner' Hands-on Lab on Tuesday 26th.

Come to my Hands-on Labs to be in with chance to win a copy of my book on Oracle Data Mining.


Hopefully I'll see you there!

Check out the full agenda by clicking on the image below.


Tuesday, January 19, 2016

ORE video : Demo Code Part 4

The following is the fourth set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following code example illustrate how you can build a Data Mining model using the in-database data mining algorithms. In this example a Decision Tree model is created. This model is then applied to new data, scoring this data with the predicted values.

> #
> # Build am in-database ODM Decision Tree
> #
> dtData <- ore.get("MINING_DATA_BUILD_V")
> # Create a ODM DT model in the DB : Only a temporary model. It is deleted when you logout
> dtModel <- ore.odmDT(AFFINITY_CARD ~ ., dtData)
> # View the details of the ODM model
> #summary(dtModel)
> names(dtModel)
 [1] "name"          "settings"      "attributes"    "costs"         "distributions”
 [6] "nodes"         "formula"       "extRef"        "call"         
> dtModel$name
 [1] "ORE$208_210”
> dtModel$settings
                          value                    on
impurity.metric   impurity.gini
term.max.depth                7
term.minpct.node           0.05
term.minpct.split           0.1
term.minrec.node             10
term.minrec.split            20
> dtModel$attributes
                 name        type data.type data.length precision scale
1       AFFINITY_CARD categorical    number          22         0     0      TRUE
2                 AGE   numerical    number          22        NA    NA     FALSE
3 CUST_MARITAL_STATUS categorical  varchar2          20        NA    NA     FALSE
4           EDUCATION categorical  varchar2          21        NA    NA     FALSE
5      HOUSEHOLD_SIZE categorical  varchar2          21        NA    NA     FALSE
6          OCCUPATION categorical  varchar2          21        NA    NA     FALSE
7       YRS_RESIDENCE   numerical    number          22        NA    NA     FALSE

> ## Compute the Compusion Matrix
> dtResults <- predict(dtModel, dtData, "AFFINITY_CARD")
> with(dtResults, table(AFFINITY_CARD, PREDICTION))
            0 1056   64
            1  201  179
> ## How do you persist the model in the DB
> ##     Rename and save the model in the database
> dtModel$name
 [1] "ORE$208_210"

> ## Save the ODM model in the in-database R datastore
>, name = "ORE_MODELS", overwrite=TRUE)
> ore.load(name = "ORE_MODELS")
 [1] "dtModel"

> ## Score new data using the DM Model
> ore.sync(table = c("MINING_DATA_APPLY"))
 [7] "MINING_DATA_TEST_V"   > dtApply <- ore.get("MINING_DATA_APPLY")
> dim(dtApply)
 [1] 1500   18
> class(dtApply)
 [1] "ore.frame”
 [1] "OREbase”
> DTAPPLY <- ore.push(dtApply)
> dtApplyResult <- predict(dtModel, DTAPPLY)

> dtApplyResult <- predict(dtModel, DTAPPLY)
> head(dtApplyResult)
             '0'        '1' PREDICTION
100001 0.9521912 0.04780876          0
100002 0.9521912 0.04780876          0
100003 0.9521912 0.04780876          0
100004 0.9521912 0.04780876          0
100005 0.2633745 0.73662551          1
100006 0.9521912 0.04780876          0
> dim(dtApplyResult)
 [1] 1500    3
> dim(dtApply)
 [1] 1500   18
> dtResults <- cbind(dtApply, dtApplyResult)
> dim(dtResults)
 [1] 1500   21
> ore.drop(table = "DEMO_R_APPLY_RESULT")
> ore.create(dtApplyResult, table="DEMO_R_APPLY_RESULT")
> ## Run the following for the first time you will rename a mode
> # ore.exec(paste("BEGIN> 
  #                  DBMS_DATA_MINING.RENAME_MODEL(model_name => '", dtModel$name, "',> 
  #                      new_model_name => 'DEMO_R_DT_MODEL'); END;",sep=""))> 
  ## Run the following to refresh an existing model
> ore.exec(paste("BEGIN
+ DBMS_DATA_MINING.RENAME_MODEL(model_name => '", dtModel$name,"',
+ new_model_name => 'DEMO_R_DT_MODEL');
+ END;",sep=""))

Tuesday, January 12, 2016

ORE video : Demo Code Part 3

The following is the third set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following code is illustrates some simple examples of using Oracle R Enterprise. In these example you will see how to connect to the Oracle Database, how to query and process some of the tables and views in the Oracle Database, how to check that you are working with objects in the database, how to move data to the database and query it.

> library(ORE)
> # ore.connect(user="rquser", sid="orcl", host="localhost", password="rquser", port=1521, all=TRUE);
> ore.connect(user="dmuser", sid="orcl", host="localhost", password="dmuser", port=1521, all=FALSE);
> # Test the connection
 [1] TRUE
> # List all the tables and views
> # Use ore.sync to only include the tables and views listed
> ore.sync()
> # Disconnect and reattached with no meta-data sync
> ore.disconnect()
> ore.connect(user="dmuser", sid="orcl", host="localhost", password="dmuser", port=1521, all=FALSE);
> # Check for the existance of a table or view
> ore.exists("MINING_DATA_BUILD_V")
 [1] TRUE
> # list the objects in the DMUSER schema
> #
> # Load data from a file into a new table
> ore.exists("DEMO_R_TABLE")
 [1] TRUE
> ore.drop(table='DEMO_R_TABLE')
> titanic <- read.table("c:/R/titanic2.txt", header=T, sep="\t")
> ore.create(titanic, table="DEMO_R_TABLE")
> tData <- ore.get("DEMO_R_TABLE")
> head(tData)
                 NAME PCLASS AGE    SEX SURVIVED
1 Fynney, Mr Joseph J    2nd  35   male        0
2      Gale, Mr Harry    2nd  35   male        0
3   Gale, Mr Shadrach    2nd  38   male        0
4 Garside, Miss Ethel    2nd  24 female        1
5  Gaskell, Mr Alfred    2nd  16   male        0
6  Gavey, Mr Lawrence    2nd  26   male        0
> # Use ORE to pull data from the Database to local R
> # ore.pull  -- United States of America
> mdbv <- ore.get("MINING_DATA_BUILD_V")
> mdbv_data <- ore.pull(mdbv)
Warning message:ORE object has no unique key - using random order 
> head(mdbv_data,3)
1  101501           F  41              NeverM United States of America J: 190,000 - 249,999   Masters      Prof.
2  101502           M  27              NeverM United States of America I: 170,000 - 189,999     Bach.      Sales
3  101503           F  20              NeverM United States of America H: 150,000 - 169,999   HS-grad    Cleric.
1              2             4             0                   1                  1                    1
2              2             3             0                   1                  1                    0
3              2             2             0                   1                  0                    0
1                       1                1           0                0
2                       1                1           1                0
3                       1                1           1                0
> class(mdbv_data)
[1] "data.frame”
> summary(mdbv_data)

Wednesday, January 6, 2016

ORE video : Demo Code Part 2

The following is the second set of demo code from my video on using R in the Oracle Database. Check out the video before using the following code. The blog post for the video will be updated to contain links to all blog posts that have the various demo code.

The following code gives a very quick demonstration of using the RORACLE R package to access the data in your Oracle schema. ROracle has a number of advantages over using RJDBC and most of the advantages are about the performance improvements. Typically when using ROracle you will see a many fold improvement with selecting data and moving it to your R client, processing data in the database and also writing data back to the Oracle Database. In some tests you can see a 7 times improvement in performance over RJDBC. Now that is a big difference.

But the problem with ROracle is that it is only available on certain platforms/OS. For example it is not officially available for the Mac. But if you google this issue carefully you will find unofficial ways over coming this problem.

ROracle is dependent on Oracle Client. So you will need to have Oracle Client installed on you machine and have it available on the search path.

When you have Oracle Client installed and the ROracle R package installed you are ready to start using it.

So here is the demo code from the video.

> library(ROracle)
> drv <- dbDriver("Oracle")
> # Create the connection string
> host <- "localhost"
> port <- 1521
> sid <- "orcl"
>connect.string <- paste("(DESCRIPTION=”, "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
>    "(CONNECT_DATA=(SID=", sid, ")))", sep = "")

> con <- dbConnect(drv, username = "dmuser", password = "dmuser",dbname=connect.string)

> rs <- dbSendQuery(con, "select view_name from user_views")
> # fetch records from the resultSet into a data.frame
> data <- fetch(rs)
> # extract all rows
> dim(data)
[1] 6 1
> data
> dbCommit(con)
> dbClearResult(rs)
> dbDisconnect(con)

Wednesday, December 30, 2015

ORE Video : Demo Code part 1

In a previous blog post I posted a video on using R with the Oracle Database and using Oracle R Enterprise. This is a part 1 extension of that blog post that gives the first set of demo code.

This first set of demonstration code is for using RJDBC to connect to the Oracle Database. Using RJDBC relies on using the JDBC jar file for Oracle. It is easily found in various installations of Oracle products and will be called something like ojdbc.jar. I like to take a copy of this file and place it in the root/home directory.

> library(RJDBC)
> # Create connection driver and open 
> connectionjdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="c:/ojdbc6.jar")
> jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@//localhost:1521/orcl", "dmuser", "dmuser")
> #list the tables in the schema
> #dbListTables(jdbcConnection)
> #get the DB connections details - it get LOTS of info - Do not run unless it is really needed
> dbGetInfo(jdbcConnection)
> # Query on the Oracle instance name.
> #instanceName <- dbGetQuery(jdbcConnection, "SELECT instance_name FROM v$instance")
1  INSUR_CUST_LTV_SAMPLE2            
2              OUTPUT_1_2
> #print(instanceName)tableNames <- dbGetQuery(jdbcConnection, "SELECT table_name from user_tables where  
                                                 table_name not like 'DM$%' and table_name not like 'ODMR$%'")
> print(tableNames)
> viewNames <- dbGetQuery(jdbcConnection, "SELECT view_name from user_views")print(viewNames)

> v <- dbReadTable(jdbcConnection, "MINING_DATA_BUILD_V")
> names(v)
[1] "CUST_ID"                 "CUST_GENDER"             "AGE"                     
[7] "EDUCATION"               "OCCUPATION"              "HOUSEHOLD_SIZE"         
[10] "YRS_RESIDENCE"           "AFFINITY_CARD"           "BULK_PACK_DISKETTES"    
[16] "PRINTER_SUPPLIES"        "Y_BOX_GAMES"             "OS_DOC_SET_KANJI" 
> dim(v)
[1] 1500   18
> summary(v)
Min.   :101501   Length:1500        Min.   :17.00   Length:1500         Length:1500        
1st Qu.:101876   Class :character   1st Qu.:28.00   Class :character    Class :character   
Median :102251   Mode  :character   Median :37.00   Mode  :character    Mode  :character   
Mean   :102251                      Mean   :38.89                                          
3rd Qu.:102625                      3rd Qu.:47.00                                          
Max.   :103000                      Max.   :90.00                                          
Length:1500        Length:1500        Length:1500        Length:1500        Min.   : 0.000   
Class :character   Class :character   Class :character   Class :character   1st Qu.: 3.000   
Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 4.000                                                                               
                                                                            Mean   : 4.089                                                                               
                                                                            3rd Qu.: 5.000                                                                               
                                                                            Max.   :14.000 
> hist(v$RESIDENCE)
> hist(v$AGE)
> dbDisconnect(jdbcConnection)

Make sure to check out the other demonstration scripts that are shown in the video.

Tuesday, December 29, 2015

Oracle R Enterprise 1.5 (new release)

The Oracle Santa had a busy time just before Christmas with the release of several new version of products. One of these was Oracle R Enterprise version 1.5.

Oracle R Enterprise (1.5) is part of the Oracle Advanced Analytics option for the enterprise edition of the Oracle Database.

As with every new release of a product there are a range of bug fixes. But with ORE 1.5 there are also some important new features. These important new features include:

  • New Random Forest specific for ORE.
  • New ORE Data Store functions and privileges.
  • Partitioning on multiple columns for ore.groupApply.
  • Multiple improvements to ore.summary.
  • Now performs parallel in-database execution for functions prcomp and svd.
  • BLOB and CLOB data types are now supported in some of the ORE functions.

Check out the ORE 1.5 Release Notes for more details on the new features.

ORE 1.5 is only certified (for now) on R 3.2.x in both the open source version and the Oracle R Distribution version 3.2.

Check out the ORE 1.5 Documentation.

You can download ORE 1.5 Server side and Client side software here.

Monday, December 21, 2015

Running R in the Oracle Database video

Earlier this year I was asked by the Business Analysics & Big Data SIG (of the UKOUG) to give a presentation on Oracle R Enterprise. Unfortunately I had already committed to giving the same presentation at the OUG Norway conference on the same day.

But then they asked me if I could record a video of the presentation and they would show it at the SIG. The following video is what I recorded.

At the UKOUG annual (2015) conferences I was supposed to give a 2 hour presentation during their Super Sunday event. Unfortunately due to a storm passing over Ireland on the Saturday all flights going to the UK were cancelled. This meant that I would miss my 2 hour presentation.

Instead of trying to find an alternative speaker for my presentation slot at such sort notice, the committee suggested that they would show the video.

Based on the feedback and the people who thanked me in person during the rest of the conference, I've decided to make it available to everyone. Hopefully you will find it useful.

The following are the links to the demo code that is shown or referred to in the video.

People have been asking me if the demo scripts I used in video are available. You will probably find some of these on various blog posts. So to make it easier for everyone I will post the demo scripts in one or more blog posts over the coming weeks. When these are available I will update this blog post with the links.

I have a few new presentations on Oracle R Enterprise in 2016 so watch out for these at an Oracle User Group conference.

Saturday, December 12, 2015

KScope 2016 Acceptances

I've never been to KScope. Yes never.

I've always wanted to. Each year you hear of all of these stories about how much people really enjoy KScope and how much they learn.

So back in October I decided to submit 5 presentations to KScope. 4 of these presentations are solo presentations and 1 joint presentation.

This week I have received the happy news that 2 of my solo presentations have been accepted, plus my joint presentation with Kim Berg Hansen.

So at the end of June 2016 I will be making my way to Chicago for a week of Oracle geekie fun at KScope.

My presentations will be:

  • Is Oracle SQL the best language for Statistic?
  • Running R in your Oracle Database using Oracle R Enterprise

and my join presentations is called

Forecasting in Oracle using the Power of SQL (this will talk about ROracle, Forecasting in R, Using Oracle R Enterprise and SQL)

I was really hoping that one of my rejected presentations would have been accepted. I really enjoy this presentation and I get to share stories about some of my predictive analytics projects. Ah well, maybe in 2017.

The last time I was in Chicago was over 15 years ago when I sent 5 days in Cellular One (The brand was sold to Trilogy Partners by AT&T in 2008 shortly after AT&T had completed its acquisition of Dobson Communications). I was there to kick off a project to build them a data warehouse and to build their first customer churn predictive model. I stayed in a hotel across the road from their office which was famous because a certain person had stayed in it why one the run. Unfortunately I didn't get time to visit downtown Chicago.

Wednesday, December 2, 2015

OUG Ireland 2015 : Call for Papers

We finally have confirmed dates for the OUG Ireland Annual Conference. It will be on the 3rd & 4th March, 2016.

In 2016 we are expanding the conference and marking it a 2 day conference. Over the past few years some of the feedback from delegates has been, can we have more sessions and add a second day. So we have listened and will now have a 2 day conference on 3rd & 4th March, 2016.

We also have a new venue too. The 2016 conference will be held in the Gresham Hotel, on O'Connell Street in central Dublin. It is so close to lots of the tourist attractions, lots of hotels and lots of entertainment.

The call for papers is now open and we are looking for presentations for the typical Business Analytics/Big Data, Hardcore DBA, Development, Cloud and Applications.

Do you have a story to share? Have you discovered something recently? Have you used a new product? Is there anything about some part of an Oracle Product you would like to share?

If so then click on the image below to go to the Submissions Website and get submitting. Closing date for submissions is 9am on 4th January.


Hopefully I'll see you in Dublin in March.