Wednesday, May 27, 2015

R (ROracle) and Oracle DATE formats

When you comes to working with R to access and process your data there are a number of little features and behaviours you need to look out for.

One of these is the DATE datatype.

The main issue that you have to look for is the TIMEZONE conversion that happens then you extract the data from the database into your R environment.

There is a datatype conversions from the Oracle DATE into the POSIXct format. The POSIXct datatype also includes the timezone. But the Oracle DATE datatype does not have a Timezone part of it.

When you look into this a bit more you will see that the main issue is what Timezone your R session has. By default your R session will inherit the OS session timezone. For me here in Ireland we have the time timezone as the UK. You would time that the timezone would therefore be GMT. But this is not the case. What we have for timezone is BST (or British Standard Time) and this takes into account the day light savings time. So on the 26th May, BST is one hour ahead of GMT.

This issue is not just between R and Oracle, but there are some inconsistencies in some of Oracle's other tools list SQL Developer, SQLcl and SQL*Plus.
SQL*Plus

NewImage
SQLcl & SQL Developer

NewImage

NewImage

OK. Let's get back to the issue with we are using R to analyse our data.

The Problem

As mentioned above, when I select date of type DATE from Oracle into R, using ROracle, I end up getting a different date value than what was in the database. Similarly when I process and store the data.

The following outlines the data setup and some of the R code that was used to generate the issue/problem.

Data Set-up

Create a table that contains a DATE field and insert some records.

CREATE TABLE STAFF
    (STAFF_NUMBER VARCHAR2(20),
    FIRST_NAME VARCHAR2(20),
    SURNAME VARCHAR2(20),
    DOB DATE,
    PROG_CODE VARCHAR2(6 BYTE),
    PRIMARY KEY (STAFF_NUMBER));

insert into staff values (123456789, 'Brendan', 'Tierney', to_date('01/06/1975', 'DD/MM/YYYY'), 'DEPT_1');
insert into staff values (234567890, 'Sean', 'Reilly', to_date('21/10/1980', 'DD/MM/YYYY'), 'DEPT_2');
insert into staff values (345678901, 'John', 'Smith', to_date('12/03/1973', 'DD/MM/YYYY'), 'DEPT_3');
insert into staff values (456789012, 'Barry', 'Connolly', to_date('25/01/1970', 'DD/MM/YYYY'), 'DEPT_4');


You can query this data in SQL without any problems. As you can see there is no timezone element to these dates.

Selecting the data

I now establish my connection to my schema in my 12c database using ROracle. I won't bore you with the details here of how to do it but check out point 3 on this post for some details.

When I select the data I get the following.


[1] "1975-06-01 01:00:00 BST" "1980-10-21 01:00:00 BST" "1973-03-12 00:00:00 BST"
[4] "1970-01-25 01:00:00 BST"

As you can see two things have happened to my date data when it has been extracted from Oracle. Firstly it has assigned a timezone to the data, even though there was no timezone part of the original data. Secondly it has performed some sort of timezone conversion to from GMT to BST. The difference between GMT and BTS is the day light savings time. Hence the 01:00:00 being added to the time element that was extract. This time should have been 00:00:00. You can see we have a mixture of times!

So there appears to be some difference between the R date or timezone to what is being used in Oracle.

To add to this problem I was playing around with some dates and different records. I kept on getting this scenario but I also got the following, where we have a mixture of GMT and BST times and timezones. I'm not sure why we would get this mixture.

 > data$DOB
[1] "1995-01-19 00:00:00 GMT" "1965-06-20 01:00:00 BST" "1973-10-20 01:00:00 BST"
[4] "2000-12-28 00:00:00 GMT"

This is all a bit confusing and annoying. So let us look at how you can now fix this.

The Solution

Fixing the problem : Setting Session variables

What you have to do to fix this and to ensure that there is consistency between that is in Oracle and what is read out and converted into R (POSIXct) format, you need to define two R session variables. These session variables are used to ensure the consistency in the date and time conversions.

These session variables are TZ for the R session timezone setting and Oracle ORA_SDTZ setting for specifying the timezone to be used for your Oracle connections.

The trick there is that these session variables need to be set before you create your ROracle connection. The following is the R code to set these session variables.


 > Sys.setenv(TZ = "GMT")
 > Sys.setenv(ORA_SDTZ = "GMT")

So you really need to have some knowledge of what kind of Dates you are working with in the database and if a timezone if part of it or is important. Alternatively you could set the above variables to UDT.

Selecting the data (correctly this time)

Now when we select our data from our table in our schema we now get the following, after reconnecting or creating a new connection to your Oracle schema.

 > data$DOB
[1] "1975-06-01 GMT" "1980-10-21 GMT" "1973-03-12 GMT" "1970-01-25 GMT"

Now you can see we do not have any time element to the dates and this is correct in this example. So all is good.

We can now update the data and do whatever processing we want with the data in our R script.

But what happens when we save the data back to our Oracle schema. In the following R code we will add 2 days to the DOB attribute and then create a new table in our schema to save the updated data.


I've used the R package Libridate to do the date and time processing.

When we look at this newly created table in our Oracle schema we will see that we don't have DATA datatype for DOB, but instead it is created using a TIMESTAMP data type.



If you are working with TIMESTAMP etc type of data types (i.e. data types that have a timezone element that is part of it) then that is a slightly different problem. Perhaps one that I'll look at soonish.