Showing posts with label Visualizations. Show all posts
Showing posts with label Visualizations. Show all posts

Tuesday, September 6, 2016

Change the size of ORE PNG graphics using in-database R functions

In a previous blog post I showed you how create and display a ggplot2 R graphic using SQL. Make sure to check it out before reading the rest of this blog post.
In my previous blog post, I showed and mentioned that the PNG graphic returned by the embedded R execution SQL statement was not the same as what was produced if you created the graphic in an R session.
Here is the same ggplot2 graphic. The first one is what is produced in an R session and the section is what is produced by SQL query and the embedded R execution in Oracle.
NewImage NewImage
As you can see the second image (produced using the embedded R execution) gives a very square image.
The reason for this is that Oracle R Enterprise (ORE) creates the graphic image in PNG format. The default setting from this is 480 x 480. You will find this information when you go digging in the R documentation and not in the Oracle documentation.
So, how can I get my ORE produced graphic to appear like what is produced in R?
What you need to do is to change the height and width of the PNG image produced by ORE. You can do this by passing parameters in the SQL statement used to call the user defined R function, that in turn produces the ggplot2 image.
In my previous post, I gave the SQL statement to call and produce the graphic (shown above). One of the parameters to the rqTableEval function was set to null. This was because we didn't have any parameters to pass, apart from the data set.
We can replace this null with any parameters we want to pass to the user defined R function (demo_ggpplot). To pass the parameters we need to define them using a SELECT statement.
cursor(select 500 as "ore.png.height", 850 as "ore.png.width" from dual),
The full SELECT statement now becomes
select *
from table(rqTableEval( cursor(select * from claims),
                        cursor(select 500 as "ore.png.height", 850 as "ore.png.width" from dual),
When you view the graphic in SQL Developer, you will get something that looks a bit more like what you would expect or want to see.
For each graphic image you want to produce using ORE you will need to figure out that are the best PNG height and width settings to use. Plus it also depends on what tool or application you are going to use to display the images (eg. APEX etc)

Monday, July 11, 2016

Creating ggplot2 graphics using SQL

Did you read the title of this blog post! Read it again.

Yes, Yes, I know what you are saying, "SQL cannot produce graphics or charts and particularly not ggplot2 graphics".

You are correct to a certain extent. SQL is rubbish a creating graphics (and I'm being polite).

But with Oracle R Enterprise you can now produce graphics on your data using the embedded R execution feature of Oracle R Enterprise using SQL. In this blog post I will show you how.

1. Pre-requisites

You need to have installed Oracle R Enterprise on your Oracle Database Server. Plus you need to install the ggplot2 R package.

In your R session you will need to setup a ORE connection to your Oracle schema.

2. Write and Test your R code to produce the graphic

It is always a good idea to write and test your R code before you go near using it in a user defined function.

For our (first) example we are going to create a bar chart using the ggplot2 R package. This is a basic example and the aim is to illustrate the steps you need to go through to call and produce this graphic using SQL.

The following code using the CLAIMS data set that is available with/for Oracle Advanced Analytics. The first step is to pull the data from the table in your Oracle schema to your R session. This is because ggplot2 cannot work with data referenced by an ore.frame object.

data.subset <- ore.pull(CLAIMS) 

Next we need to aggregate the data. Here we are counting the number of records for each Make of car.

aggdata2 <- aggregate(data.subset$POLICYNUMBER,
                      by = list(MAKE = data.subset$MAKE),
                      FUN = length)

Now load the ggplot2 R package and use it to build the bar chart.

ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + 
       geom_bar(color="black", stat="identity") +
       xlab("Make of Car") + 
       ylab("Num of Accidents") + 
       ggtitle("Accidents by Make of Car")

The following is the graphic that our call to ggplot2 produces in R.


At this point we have written and tested our R code and know that it works.

3. Create a user defined R function and store it in the Oracle Database

Our next step in the process is to create an in-database user defined R function. This is were we store R code in our Oracle Database and make this available as an R function. To create the user defined R function we can use some PL/SQL to define it, and then take our R code (see above) and in it.

   -- sys.rqScriptDrop('demo_ggpplot');
      'function(dat) {
         aggdata2 <- aggregate(dat$POLICYNUMBER,
                      by = list(MAKE = dat$MAKE),
                      FUN = length)

        g <-ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + geom_bar(color="black", stat="identity") +
                   xlab("Make of Car") + ylab("Num of Accidents") + ggtitle("Accidents by Make of Car")


We have to make a small addition to our R code. We need need to include a call to the plot function so that the image can be returned as a BLOB object. If you do not do this then the SQL query in step 4 will return no rows.

4. Write the SQL to call it

To call our defined R function we will need to use one of the ORE SQL API functions. In the following example we are using the rqTableEval function. The first parameter for this function passes in the data to be processed. In our case this is the data from the CLAIMS table. The second parameter is set to null. The third parameter is set to the output format and in our case we want this to be PNG. The fourth parameter is the name of the user defined R function.

select *
from table(rqTableEval( cursor(select * from claims),

5. How to view the results

The SQL query in Step 4 above will return one row and this row will contain a column with a BLOB data type.


The easiest way to view the graphic that is produced is to use SQL Developer. It has an inbuilt feature that allows you to display BLOB objects. All you need to do is to double click on the BLOB cell (under the column labeled IMAGE). A window will open called 'View Value'. In this window click the 'View As Image' check box on the top right hand corner of the window. When you do the R ggplot2 graphic will be displayed.


Yes the image is not 100% the same as the image produced in our R session. I will have another blog post that deals with this at a later date.

But, now you have written a SQL query, that calls R code to produce an R graphic (using ggplot2) of our data.

6. Now you can enhance the graphics (without changing your SQL)

What if you get bored with the bar chart and you want to change it to a different type of graphic? All you need to do is to change the relevant code in the user defined R function.

For example, if we want to change the graphic to a polar plot. The following is the PL/SQL code that re-defines the user defined R script.

      'function(dat) {
         aggdata2 <- aggregate(dat$POLICYNUMBER,
                      by = list(MAKE = dat$MAKE),
                      FUN = length)

         n <- nrow(aggdata2)
         degrees <- 360/n

        aggdata2$MAKE_ID <- 1:nrow(aggdata2)

        g<- ggplot(data=aggdata2, aes(x=MAKE, y=x, fill=MAKE)) + geom_bar(color="black", stat="identity") +
               xlab("Make of Car") + ylab("Num of Accidents") + ggtitle("Accidents by Make of Car") + coord_polar(theta="x") 

We can use the exact same SQL query we defined in Step 4 above to call the next graphic.


All done.

Now that was easy! Right?

I kind of is easy once you have been shown. There are a few challenges when working in-database user defined R functions and writing the SQL to call them. Most of the challenges are around the formatting of R code in the function and the syntax of the SQL statement to call it. With a bit of practice it does get easier.

7. Where/How can you use these graphics ?

Any application or program that can call and process a BLOB data type can display these images. For example, I've been able to include these graphics in applications developed in APEX.

Thursday, June 30, 2016

googleVis R package for creating google charts in R

I've recently come across the 'googleVis' R package. This allows you to create a variety of different (typical and standard) charts in R but with the look and feel of the charts we can get from a number of different Google sites.

I won't bore you with some examples in the post but I'll point you to a good tutorial on the various charts.

Here is the link to the mini-tutorial.

Before you can use the package you will need to install it. The simplest way is to run the following in your R session.

> install.packages("googleVis")

Depending on your version of R you may need to upgrade.

Here is a selection of some of the charts you can create, and there are many, many more.


Some of you might be familiar with the presenting that Hans Rosling gives. Some of the same technology is behind these bubble charts from Google, as they bought the software years ago. Hans typically uses a data set that consists of GDP, Population and Life Expectancy for countries around the World. You too can use this same data set and is available from rdatamarket. The following R codes will extract this data set to you local R session and you can then use it as input to the various charts in the googleVis functions.


# Pull in life expectancy and population data
life_expectancy <- dmlist("15r2!hrp")
population <- dmlist("1cfl!r3d")

# Pull in the yearly GDP for each country
gdp <- dmlist("15c9!hd1")

# Load in the plyr package

# Rename the Value for each dataset
names(gdp)[3] <- "GDP"

# Use plyr to join your three data frames into one: development 
gdp_life_exp <- join(gdp, life_expectancy)
names(gdp_life_exp)[4] <- "LifeExpectancy"
development <- join(gdp_life_exp, population)
names(development)[5] <- "Population"

Here is an example of the bubble chart using this data set.


There are a few restrictions with using this package. All the results will be displayed in a web browser, so you need to make sure that this is possible. Some of the charts are require flash. Again you need to make sure you are the latest version and/or you many have restrictions in your workplace on using it.

Tuesday, July 28, 2015

Charting Number of R Packages over time (Part 3)

This is the third and final blog post on analysing the number of new R packages that have been submitted over time.

Check out the previous blog posts:

In this blog post I will show you how you can perform Forecasting on our data to make some predictions on the possible number of new packages over the next 12 months.

There are 2 important points to note here:

  1. Only time will tell if these predictions are correct or nearly connect. Just like with any other prediction techniques.
  2. You cannot use just one of the Forecasting techniques in isolation to make a prediction. You need to use a number of functions/algorithms to see which one suits your data best.

The second point above is very important with all prediction techniques. Sometimes you see people/articles talking about them only using algorithm X. They have not considered any of the other techniques/algorithms. It is their favourite or preferred method. But that does not mean it works or is suitable for all data sets and all scenarios.

In this blog post I'm going to use 3 different forecasting functions, the in-build Forecast function in R, using HoltWinters and finally using ARIMA. Yes there are many more (it is R after all) and I'll leave these for you to explore.

1. Convert data set to Time Series data format

The first thing I need to do is to convert the data I want analyze into TimeSeries format (ts). This looks to have one record or instance for each data point.

So you cannot not have any missing data, or in my case any missing dates. Yes (for my data set) we could have some months where we do not have any submissions. What I could do is to work out mean values (or things like that) and fill for the missing months. But I'm feeling a bit lazy and after examining the data I see that we have a continuous set of data from September 2009 onwards. This is fine as most of the growth up to that point is flat.

So I need to subset the data to only include cases greater than or equal to September 2009 and less than or equal to June 2015. I wanted to explore July 2015 as the number for this month is incomplete.

The following code builds on the work we did in the second blog post in the series


# Subset the data
sub_data <- subset(data.sum, >= as.Date("2009-08-01", "%Y-%m-%d"))
sub_data <- subset(sub_data, <= as.Date("2015-06-01", "%Y-%m-%d"))

# Subset again to only take out the data we want to use in the time series
sub_data2 <- sub_data[,c("R_NUM")]

# Create the time series data, stating that it is monthly (12) and giving the start and end dates
ts_data <- ts(sub_data2, frequency=12, start=c(2009, 8), end=c(2015, 6))

# View the time series data
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2009                               2   3   4   3   4
2010   5   5   1  11   5   2   5   4   4   5   1   3
2011  11   4   3   6   6   5   9  15   5   8  23  18
2012  33  17  51  28  37  33  50  71  41 231  51  60
2013  75  67  76  81  76  74  77  89 111  96 111 200
2014 155 129 175 140 145 133 155 207 232 162 229 310
2015 308 343 332 378 418 558    

We now have the data prepared for input to our Forecasting functions in R.

2. Using Forecast in R

For the Forecast function all you need to do is pass in the Time Series dataset and tell the function how my steps into the future you want it to predict. In all my examples I'll ask the functions to predict for the next 12 months.

ts_forecast <- forecast(ts_data, h=12)

         Point Forecast     Lo 80     Hi 80 Lo 95     Hi 95
Jul 2015       447.1965  95.67066  784.3163     0  958.6669
Aug 2015       499.4344 115.94329  873.1822     0 1069.8689
Sep 2015       551.7875 123.88426  952.0773     0 1230.4707
Oct 2015       603.7212 156.89486 1078.0395     0 1370.2069
Nov 2015       654.7647 143.29718 1179.4903     0 1603.8335
Dec 2015       704.5162 135.76829 1352.8925     0 1844.7230
Jan 2016       752.6447 151.09936 1502.6088     0 2100.9708
Feb 2016       798.8877 156.37383 1652.0847     0 2575.3715
Mar 2016       843.0474 159.67095 1848.1703     0 2888.1738
Apr 2016       884.9849 154.59456 2061.7990     0 3281.6062
May 2016       924.6136 148.04651 2325.9060     0 3891.5064
Jun 2016       961.8922 138.67935 2531.7578     0 4395.6033



For this we get a very large range of values and very wide predictive intervals. If we limit the y axis we can get a better picture of the actual predictions.

plot(ts_forecast, ylim=range(0:1000))

3. Using HoltWinters

For HoltWinters we can use the in-built R function for this. All we need to do is to pass in the Time Series data set. The first part we can plot the HoltWinters for the existing data set

hw <- HoltWinters(ts_data)

Now we want to predict for the next 12 months

forecast <- predict(hw, n.ahead = 12, prediction.interval = T, level = 0.95)

              fit       upr      lwr
Jul 2015 519.9304  599.8097 440.0512
Aug 2015 560.1083  648.4183 471.7983
Sep 2015 601.4528  701.0163 501.8892
Oct 2015 643.9639  757.3750 530.5529
Nov 2015 681.5168  811.0727 551.9608
Dec 2015 724.7363  872.4508 577.0218
Jan 2016 773.8308  941.4768 606.1848
Feb 2016 809.8836  999.0401 620.7272
Mar 2016 847.1448 1059.2371 635.0525
Apr 2016 898.4476 1134.7795 662.1158
May 2016 933.8755 1195.6532 672.0977
Jun 2016 972.3866 1260.7376 684.0356

plot(hw, forecast)

4. Using ARIMA

For ARIMA we need to perform a simple conversion of the Time Series data into ARIMA format and then perform the forecase

fc_arima <- auto.arima(ts_data)
fc_fc_arima  <- forecast(fc_arima, h=12)

         Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
Jul 2015       524.4758 476.2203  572.7314 450.6753  598.2764
Aug 2015       567.1156 513.2301  621.0012 484.7048  649.5265
Sep 2015       609.7554 548.3239  671.1869 515.8041  703.7068
Oct 2015       652.3952 581.6843  723.1062 544.2522  760.5383
Nov 2015       695.0350 613.5293  776.5408 570.3828  819.6873
Dec 2015       737.6748 644.0577  831.2920 594.4998  880.8499
Jan 2016       780.3147 673.4319  887.1974 616.8516  943.7777
Feb 2016       822.9545 701.7797  944.1292 637.6337 1008.2752
Mar 2016       865.5943 729.2004 1001.9881 656.9978 1074.1907
Apr 2016       908.2341 755.7718 1060.6963 675.0631 1141.4050
May 2016       950.8739 781.5559 1120.1918 691.9244 1209.8233
Jun 2016       993.5137 806.6027 1180.4246 707.6581 1279.3693

plot(fc_fc_arima, ylim=range(0:800))

As you can see there are very different results from each of these forecasting techniques. If this was a real life project on real data then we would go about exploring a lot more of the Forecasting function available in R. The reason for this is to identify which R function and Forecasting algorithm works best for our data.

Which Forecasting technique would you choose from the selection above?

But will this function and algorithm always work with our data? The answer is NO. As our data evolves so may the algorithm that works best for our data. This is why the data science/analytics world is iterative. We need to recheck/revalidate the functions/algorithms to see if we need to start using something else or not. When we do need to use another function/algorithm you need to ask yourself why this has happened, what has changed in the data, what has changed in the business, etc.

Wednesday, July 22, 2015

Charting Number of R Packages over time (Part 2)

This is the second blog post on charting the number of new R Packages over time.

Check out the first blog post that looked at getting the data, performing some simple graphing and then researching some issues that were identified using the graph.

In this blog post I will look at how you can aggregate the data, plot it, get a regression line, then plot it using ggplot2 and we will include a trend line using the geom_smooth.

1. Prepare the data

In my previous post we extracted and aggregated the data on a daily bases. This is the plot that was shown in my previous post. This gives us a very low level graph and perhaps we might get something a little bit more useable is we aggregated the data. I have the data in an Oracle Database so it would be easy for me to write another query to perform the necessary aggregation. But let's make things a little bit trickier. I'm going to use R to do the aggregation.

Our data set is in the data frame called data. What I want to do is to aggregate it up to monthly level. The first thing I did was to create a new column that contains the values of the new aggregate level.

data$R_MONTH <- format(rdate2, "%Y%m01")
data$R_MONTH <- as.Date(data$R_MONTH3, "%Y%m%d")
data.sum <- aggregate(x = data[c("R_NUM")],
                    FUN = sum,
                    by = list( = data$R_MONTH)

2. Plot the Data

We now have the data aggregated at monthly level. We can now plot the graph. Ignore the last data point on the chart. This is for July 2015 and I extracted the data on the 9th of July. So we do not have a full months of data here.

plot(as.Date(data.sum$, data.sum$R_NUM, type="b", xaxt="n", cex=0.75 , ylab="Num New Packages", main="Number of New Packages by Month")
axis(1, as.Date(data.sum$, "%Y-%d"), as.Date(data.sum$, "%Y-%d"), cex.axis=0.5, las=1)

This gives us the following graph.


3. Plot the data using ggplot2

The basic plot function of R is great and allows us to quickly and easily get some good graphs produced. But it is a bit limited and perhaps we want to create something that is a bit more elaborate. ggplot2 is a very popular package that can allow us to create a graph, building it up in a number of steps and layers to give something that is a lot more professional.

In the following example I've kept things simple and Yes I could have done so much more. I'll leave that as an exercise for you go off an do.

The first step is to use the qplot function to produce a basic plot using ggplot2. This gives us something similar to what we got from the plot function.

qplot(x=factor(data.sum$, y=data.sum$R_NUM, data=data.sum, 
       xlab="Year/Month", ylab='Num of New Packages', asp=0.5)

This gives us the following graph.


Now if we use ggplot2 then we need to specify a lot more information. Here is the equivalent plot using ggplot2 (with a line plot).


4. Include a trend line

We can very easily include a trend line in a ggplot2 graph using the geom_smooth command. In the following example we have the same chart and include a linear regression line.

plt <- ggplot(data.sum, aes(x=factor(data.sum$, y=data.sum$R_NUM)) + geom_line(aes(group=1)) +
  theme(text = element_text(size=7),
        axis.text.x = element_text(angle=90, vjust=1)) + xlab("Year / Month") + ylab("Num of New Packages") +
  geom_smooth(method='lm', se=TRUE, size = 0.75, fullrange=TRUE, aes(group=20))


We can tell a lot from this regression plot.

But perhaps we would like to see a trend line on the chart, with something like a moving averages plot. Plus I've added in a bit of scaling to help with representing the data at a monthly level.

plt <- ggplot(data.sum, aes(x=as.POSIXct(data.sum$, y=data.sum$R_NUM)) + geom_line() + geom_point() +
  theme(text = element_text(size=12),
        axis.text.x = element_text(angle=90, vjust=1)) + xlab("Year / Month") + ylab("Num of New Packages") +
  geom_smooth(method='loess', se=TRUE, size = 0.75, fullrange=TRUE) +
  scale_x_datetime(breaks = date_breaks("months"), labels = date_format("%b"))

In the third blog post on this topic I will look at how we can use some of the forecasting and predicting functions available in R. We can use these to see help us visualize what the future growth patterns might be for this data. I have some interesting things to show.

Wednesday, July 15, 2015

Charting Number of R Packages over time (Part 1)

This is the first of a three part blog post on charting and analysing the number of R package submissions.

(I will update this blog post with links to the other two posts as they come available)

I'm sure most of you have heard of the R programming language. If not then perhaps it is something that you might want to go off an learn a bit about. Why? well it is one of the most popular languages for performing various types of statistics, advanced topics on statistics and machine learning and for generating lots of cool looking graphs.

If this is not something that you might be interested then it is time to go to another website/blog.

In this blog post I'm going to chart the number of packages submitted to R and are available for download and installation.

Why am I doing this? I got bored one day after coming back from my vacation and I though it would be a useful thing to do. Then after doing this I decided to use these graphs somewhere else, but you will have to wait until 2016 to find out!

The R website has a listing of all the packages and the dates that they were submitted.


There are a variety of tools available that you can use to extract the information on this webpage and there are lots of examples or R code too. I'll leave that as a little exercise for you to do.

I extracted all of this information and stored it in a table in my Oracle Database (of course I did as I work with Oracle databases day in day out). This will allow me to easily reuse this data whenever I need it plus I can update this table with new packages from time to time.


The following R code:

  1. Setups up and ROracle connection to my schema in my database
  2. Connects to the database
  3. Setups up a query to extract the data from the table
  4. Fetches this data into an R data frame called data
  5. Reformat the date columns to remove the time element to it
  6. Plot the data

drv <- dbDriver("Oracle")
# Create the connection string
host <- "localhost"
port <- 1521
service <- "pdb12c"
connect.string <- paste(
  "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
  "(CONNECT_DATA=(SERVICE_NAME=", service, ")))", sep = "")

con <- dbConnect(drv, username = "brendan", password = "brendan",dbname=connect.string)

res<-dbSendQuery(con, "select r_date, count(*) r_num from r_packages
                       group by r_date order by 1 asc")
data <- fetch(res)

rdate<- data$R_DATE

plot(data$R_NUM~rdate2, data, type="l" , xaxt="n")
axis(1, rdate2, format(rdate2, "%b %y"), cex.axis=.7, las=1)

After I run the above code I get the following plot.


(Yes I could have done a better job on laying out the chart with all sorts of labels and colors etc)

This chart gives us a plot of the number of new submissions by day.

There are 2 very obvious things that stand out from this graph. The easiest one to deal with is that we can see that there has been substantical growth in new submissions over the past 3 years. Perhaps we need to examine these a bit closer and when you do you will find that a lot of these are existing packages that have been resubmitted with updates.

There is a very obvious peak just over half ways along the chart. We really need to investigate this to understand what has happended. This peak occurs on the 29th October 2012. What happened on the 29th October 2012 as this is clearly an anomaly with the rest of the data. Well on this date R version 2.15.2 was release and a there was a lot of update pagackes got resubmitted.

Check out my next two blog posts were I will explore this data in a bit more detail.

Part 2 blog post

Part 3 blog post