Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Monday, March 5, 2018

Python and Oracle : Fetching records and setting buffer size

If you used other languages, including Oracle PL/SQL, more than likely you will have experienced having to play buffering the number of records that are returned from a cursor. Typically this is needed when you are processing more than a few hundred records. The default buffering size is relatively small and by increasing the size of the number of records to be buffered can dramatically improve the performance of your code.

As with all things in coding and IT, the phrase "It Depends" applies here and changing the buffering size may not be what you need and my not help you to gain optimal performance for your code.

There are lots and lots of examples of how to test this in PL/SQL and other languages, but what I'm going to show you here in this blog post is to change the buffering size when using Python to process data in an Oracle Database using the Oracle Python library cx_Oracle.

Let us begin with taking the defaults and seeing what happens. In this first scenario the default buffering is used. Here we execute a query and the process the records in a FOR loop (yes these is a row-by-row, slow-by-slow approach.

import time

i = 0
# define a cursor to use with the connection
cur2 = con.cursor()
# execute a query returning the results to the cursor
print("Starting cursor at", time.ctime())
cur2.execute('select * from sh.customers')
print("Finished cursor at", time.ctime())

# for each row returned to the cursor, print the record
print("Starting for loop", time.ctime())
t0 = time.time()
for row in cur2:
    i = i+1
    if (i%10000) == 0:
        print(i,"records processed", time.ctime())

              
t1 = time.time()
print("Finished for loop at", time.ctime())
print("Number of records counted = ", i)

ttime = t1 - t0
print("in ", ttime, "seconds.")

This gives us the following output.

Starting cursor at  10:11:43
Finished cursor at  10:11:43
Starting for loop  10:11:43
10000 records processed  10:11:49
20000 records processed  10:11:54
30000 records processed  10:11:59
40000 records processed  10:12:05
50000 records processed  10:12:09
Finished for loop at  10:12:11 
Number of records counted =  55500
in  28.398550033569336 seconds.

Processing the data this way takes approx. 28 seconds and this corresponds to the buffering of approx 50-75 records at a time. This involves many, many, many round trips to the the database to retrieve this data. This default processing might be fine when our query is only retrieving a small number of records, but as our data set or results set from the query increases so does the time it takes to process the query.

But we have a simple way of reducing the time taken, as the number of records in our results set increases. We can do this by increasing the number of records that are buffered. This can be done by changing the size of the 'arrysize' for the cursor definition. This reduces the number of "roundtrips" made to the database, often reducing networks load and reducing the number of context switches on the database server.

The following gives an example of same code with one additional line.

cur2.arraysize = 500

Here is the full code example.

# Test : Change the arraysize and see what impact that has
import time

i = 0
# define a cursor to use with the connection
cur2 = con.cursor()
cur2.arraysize = 500
# execute a query returning the results to the cursor
print("Starting cursor at", time.ctime())
cur2.execute('select * from sh.customers')
print("Finished cursor at", time.ctime())

# for each row returned to the cursor, print the record
print("Starting for loop", time.ctime())
t0 = time.time()
for row in cur2:
    i = i+1
    if (i%10000) == 0:
        print(i,"records processed", time.ctime())

              
t1 = time.time()
print("Finished for loop at", time.ctime())
print("Number of records counted = ", i)

ttime = t1 - t0
print("in ", ttime, "seconds.")

Now the response time to process all the records is.

Starting cursor at 10:13:02 Finished cursor at 10:13:02 Starting for loop 10:13:02 10000 records processed 10:13:04 20000 records processed 10:13:06 30000 records processed 10:13:08 40000 records processed 10:13:10 50000 records processed 10:13:12 Finished for loop at 10:13:13 Number of records counted = 55500 in 11.780734777450562 seconds.

All done in just under 12 seconds, compared to 28 seconds previously.

Here is another alternative way of processing the data and retrieves the entire results set, using the 'fetchall' command, and stores it located in 'res'.

# Test : Change the arraysize and see what impact that has
import time

i = 0
# define a cursor to use with the connection
cur2 = con.cursor()
cur2.arraysize = 500
# execute a query returning the results to the cursor
print("Starting cursor at", time.ctime())
cur2.execute('select * from sh.customers')

t0 = time.time()
print("starting FetchAll at", time.ctime())
res = cur2.fetchall()
              
t1 = time.time()
print("finished FetchAll at", time.ctime())

ttime = t1 - t0
print("in ", ttime, "seconds.")

Tuesday, February 13, 2018

Oracle and Python setup with cx_Oracle

Is Python the new R?

Maybe, maybe not, but that I'm finding in recent months is more companies are asking me to use Python instead of R for some of my work.

In this blog post I will walk through the steps of setting up the Oracle driver for Python, called cx_Oracle. The documentation for this drive is good and detailed with plenty of examples available on GitHub. Hopefully there isn't anything new in this post, but it is my experiences and what I did.

1. Install Oracle Client
The Python driver requires Oracle Client software to be installed. Go here, download and install. It's a straightforward install. Make sure the directories are added to the search path.

2. Download and install cx_Oracle
You can use pip3 to do this.
 
pip3 install cx_Oracle

Collecting cx_Oracle
  Downloading cx_Oracle-6.1.tar.gz (232kB)
    100% |████████████████████████████████| 235kB 679kB/s
Building wheels for collected packages: cx-Oracle
  Running setup.py bdist_wheel for cx-Oracle ... done
  Stored in directory: /Users/brendan.tierney/Library/Caches/pip/wheels/0d/c4/b5/5a4d976432f3b045c3f019cbf6b5ba202b1cc4a36406c6c453
Successfully built cx-Oracle
Installing collected packages: cx-Oracle
Successfully installed cx-Oracle-6.1  

3. Create a connection in Python
Now we can create a connection. When you see some text enclosed in angled brackets <>, you will need to enter your detailed for your schema and database server.
 
# import the Oracle Python library
import cx_Oracle

# define the login details
p_username = ""
p_password = ""
p_host = ""
p_service = ""
p_port = "1521"

# create the connection
con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port)

# an alternative way to create the connection
# con = cx_Oracle.connect('/@/:1521')

# print some details about the connection and the library
print("Database version:", con.version)
print("Oracle Python version:", cx_Oracle.version)


Database version: 12.1.0.1.0
Oracle Python version: 6.1

4. Query some data and return results to Python
In this example the query returns the list of tables in the schema.
 
# define a cursor to use with the connection
cur = con.cursor()
# execute a query returning the results to the cursor
cur.execute('select table_name from user_tables')
# for each row returned to the cursor, print the record
for row in cur:
    print("Table: ", row)

Table:  ('DECISION_TREE_MODEL_SETTINGS',)
Table:  ('INSUR_CUST_LTV_SAMPLE',)
Table:  ('ODMR_CARS_DATA',)

Now list the Views available in the schema.
 
# define a second cursor
cur2 = con.cursor()
# return the list of Views in the schema to the cursor
cur2.execute('select view_name from user_views')
# display the list of Views
for result_name in cur2:
    print("View: ", result_name)

View:  ('MINING_DATA_APPLY_V',)
View:  ('MINING_DATA_BUILD_V',)
View:  ('MINING_DATA_TEST_V',)
View:  ('MINING_DATA_TEXT_APPLY_V',)
View:  ('MINING_DATA_TEXT_BUILD_V',)
View:  ('MINING_DATA_TEXT_TEST_V',)

5. Query some data and return to a Panda in Python
Pandas are commonly used for storing, structuring and processing data in Python, using a data frame format. The following returns the results from a query and stores the results in a panda.
 
# in this example the results of a query are loaded into a Panda
# load the pandas library
import pandas as pd

# execute the query and return results into the panda called df
df = pd.read_sql_query("SELECT * from INSUR_CUST_LTV_SAMPLE", con)

# print the records returned by query and stored in panda
print(df.head())

 CUSTOMER_ID     LAST    FIRST STATE     REGION SEX    PROFESSION  \
0     CU13388     LEIF   ARNOLD    MI    Midwest   M        PROF-2   
1     CU13386     ALVA   VERNON    OK    Midwest   M       PROF-18   
2      CU6607   HECTOR  SUMMERS    MI    Midwest   M  Veterinarian   
3      CU7331  PATRICK  GARRETT    CA       West   M       PROF-46   
4      CU2624  CAITLYN     LOVE    NY  NorthEast   F      Clerical   

  BUY_INSURANCE  AGE  HAS_CHILDREN   ...     MONTHLY_CHECKS_WRITTEN  \
0            No   70             0   ...                          0   
1            No   24             0   ...                          9   
2            No   30             1   ...                          2   
3            No   43             0   ...                          4   
4            No   27             1   ...                          4   

   MORTGAGE_AMOUNT  N_TRANS_ATM  N_MORTGAGES  N_TRANS_TELLER  \
0                0            3            0               0   
1             3000            4            1               1   
2              980            4            1               3   
3                0            2            0               1   
4             5000            4            1               2   

  CREDIT_CARD_LIMITS  N_TRANS_KIOSK  N_TRANS_WEB_BANK       LTV  LTV_BIN  
0               2500              1                 0  17621.00   MEDIUM  
1               2500              1               450  22183.00     HIGH  
2                500              1               250  18805.25   MEDIUM  
3                800              1                 0  22574.75     HIGH  
4               3000              2              1500  17217.25   MEDIUM  

[5 rows x 31 columns]

6. Wrapping it up and closing things
Finally we need to wrap thing up and close our cursors and our connection to the database.
 
# close the cursors
cur2.close()
cur.close()

# close the connection to the database
con.close()

Useful links
cx_Oracle website
cx_Oracle documentation
cx_Oracle examples on GitHub

Watch out for more blog posts on using Python with Oracle, Oracle Data Mining and Oracle R Enterprise.