Crunch Value Line Option Data With Oracle and Ruby
Dan Bikle -- http://bikle.com -- 2009-07-13
Topics
Motivation
Value Line
Value Line Options Data
Extract VL Options Data
Transform VL Options Data
Load VL Options Data
Mix SVM With VL Options Data
Conclusion
Motivation
An investor searching for indicators of future prices of options
contracts will likely be drawn to the web.
A company named Value Line sells information about options contracts:
http://www.google.com/search?q=value+line+options+survey
Is Value Line information useful to us?
Value Line
Value Line is a well known company.
Some information about Value Line can be found on wikipedia:
http://en.wikipedia.org/wiki/Value_Line
Many reference departments of public libraries subscribe to Value Line
publications:
http://www.google.com/search?q=public+libraries+value+line
Value Line Options Data
For subscribers, at the end of each trading day, Value Line publishes
zip files. Each zip file contains CSV files which can be loaded into
a spreadsheet:
http://www.google.com/search?q=value+line+options+survey+csv
Since CSV data is organized into rows and columns, it is easily loaded into Oracle.
If you want information about the names of columns and their
significance, we suggest that you register for a Value Line Options
Survey account and then contact their support department:
https://www.ec-server.valueline.com/products/product.html
https://www.ec-server.valueline.com/products/elect6.html
Support: vloptions@valueline.com
In the next three sections we describe some of the details associated
with loading CSV data from Value Line into Oracle.
Extract VL Options Data
In database jargon ETL is an acronym for "Extract, Transform, and Load".
The first step to extraction is to download a pair of zip files from
a section of the Value Line web site which is available to subscribers.
When we do this we see something similar to the following:
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ ls -l CALLS.ZIP PUTS.ZIP
-rw-r--r-- 1 oracle dba 8059986 2009-06-19 17:02 CALLS.ZIP
-rw-r--r-- 1 oracle dba 8427163 2009-06-19 17:02 PUTS.ZIP
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ du -sh CALLS.ZIP PUTS.ZIP
7.7M CALLS.ZIP
8.1M PUTS.ZIP
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
A demo of unzipping the files is displayed below:
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ unzip -o CALLS.ZIP
Archive: CALLS.ZIP
inflating: calls1.csv
inflating: calls2.csv
inflating: calls3.csv
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ unzip -o PUTS.ZIP
Archive: PUTS.ZIP
inflating: puts1.csv
inflating: puts2.csv
inflating: puts3.csv
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ du -sh calls?.csv puts?.csv
9.8M calls1.csv
9.0M calls2.csv
9.8M calls3.csv
9.8M puts1.csv
9.0M puts2.csv
9.8M puts3.csv
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ ls -l calls?.csv puts?.csv
-rw-r--r-- 1 oracle dba 10187142 2009-06-19 17:00 calls1.csv
-rw-r--r-- 1 oracle dba 9351794 2009-06-19 17:00 calls2.csv
-rw-r--r-- 1 oracle dba 10160915 2009-06-19 17:01 calls3.csv
-rw-r--r-- 1 oracle dba 10199457 2009-06-19 17:01 puts1.csv
-rw-r--r-- 1 oracle dba 9365793 2009-06-19 17:01 puts2.csv
-rw-r--r-- 1 oracle dba 10183314 2009-06-19 17:02 puts3.csv
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
To understand the structure of the CSV data, we load the first 30 rows
of calls1.csv into a spreadsheet 54 columns wide:
After studying the above spreadsheet we built a SQL*Loader control file to
copy the data into an oracle table:
LOAD DATA
INFILE 'calls1.csv'
APPEND INTO TABLE vlc
FIELDS TERMINATED BY ','
TRAILING NULLCOLS
(
COMPANY CHAR NULLIF (COMPANY=BLANKS)
, DIVD CHAR NULLIF (DIVD=BLANKS)
, CMRK CHAR NULLIF (CMRK=BLANKS)
, TECHRANK CHAR NULLIF (TECHRANK=BLANKS)
, UPDOWNCM CHAR NULLIF (UPDOWNCM=BLANKS)
, HISTVO CHAR NULLIF (HISTVO=BLANKS)
, INDUST CHAR NULLIF (INDUST=BLANKS)
, CMPRICE CHAR NULLIF (CMPRICE=BLANKS)
, VOLF CHAR NULLIF (VOLF=BLANKS)
, TKR CHAR NULLIF (TKR=BLANKS)
, PORC CHAR NULLIF (PORC=BLANKS)
, STRIKE CHAR NULLIF (STRIKE=BLANKS)
, OPTKR CHAR NULLIF (OPTKR=BLANKS)
, RECORDNMBR CHAR NULLIF (RECORDNMBR=BLANKS)
, EXPDAY DATE "MM/DD/YYYY" NULLIF (EXPDAY=BLANKS)
, OI CHAR NULLIF (OI=BLANKS)
, VOL CHAR NULLIF (VOL=BLANKS)
, BID CHAR NULLIF (BID=BLANKS)
, ASK CHAR NULLIF (ASK=BLANKS)
, EST CHAR NULLIF (EST=BLANKS)
, DLTA CHAR NULLIF (DLTA=BLANKS)
, BIDOPRK CHAR NULLIF (BIDOPRK=BLANKS)
, ASKOPRK CHAR NULLIF (ASKOPRK=BLANKS)
, BIDUNOV CHAR NULLIF (BIDUNOV=BLANKS)
, ASKUNOV CHAR NULLIF (ASKUNOV=BLANKS)
, RVOPT CHAR NULLIF (RVOPT=BLANKS)
, TENPERC CHAR NULLIF (TENPERC=BLANKS)
, TENPERC_MINUS CHAR NULLIF (TENPERC_MINUS=BLANKS)
, CCRK CHAR NULLIF (CCRK=BLANKS)
, CCBE CHAR NULLIF (CCBE=BLANKS)
, CCRV CHAR NULLIF (CCRV=BLANKS)
, CCPA CHAR NULLIF (CCPA=BLANKS)
, CCPRT CHAR NULLIF (CCPRT=BLANKS)
, CCMAX CHAR NULLIF (CCMAX=BLANKS)
, ITMOTM CHAR NULLIF (ITMOTM=BLANKS)
, THETA CHAR NULLIF (THETA=BLANKS)
, BIDIMPLIE CHAR NULLIF (BIDIMPLIE=BLANKS)
, ASKIMPLIE CHAR NULLIF (ASKIMPLIE=BLANKS)
, VOLFADJ CHAR NULLIF (VOLFADJ=BLANKS)
, FORECASTPRTBSTRK CHAR NULLIF (FORECASTPRTBSTRK=BLANKS)
, BIDPRBSTRK CHAR NULLIF (BIDPRBSTRK=BLANKS)
, ASKPRBSTRK CHAR NULLIF (ASKPRBSTRK=BLANKS)
, PCTDBLE CHAR NULLIF (PCTDBLE=BLANKS)
, MARGIN CHAR NULLIF (MARGIN=BLANKS)
, YLDMGN CHAR NULLIF (YLDMGN=BLANKS)
, MXMGN CHAR NULLIF (MXMGN=BLANKS)
, ROBUY CHAR NULLIF (ROBUY=BLANKS)
, ROWRITE CHAR NULLIF (ROWRITE=BLANKS)
, ROCC CHAR NULLIF (ROCC=BLANKS)
, GAMMA CHAR NULLIF (GAMMA=BLANKS)
, VEGA CHAR NULLIF (VEGA=BLANKS)
, RHO CHAR NULLIF (RHO=BLANKS)
, TIMPRPCT CHAR NULLIF (TIMPRPCT=BLANKS)
, BIDASKPCNT CHAR NULLIF (BIDASKPCNT=BLANKS)
, pricedate CHAR NULLIF (pricedate=BLANKS)
)
Then, we built a table to receive the data:
22:33:51 SQL>
22:33:52 SQL> desc vlc
Name Null? Type
-------------------------- -------- ------------------
COMPANY VARCHAR2(55)
DIVD VARCHAR2(55)
CMRK VARCHAR2(55)
TECHRANK VARCHAR2(55)
UPDOWNCM VARCHAR2(55)
HISTVO VARCHAR2(55)
INDUST VARCHAR2(55)
CMPRICE VARCHAR2(55)
VOLF VARCHAR2(55)
TKR VARCHAR2(55)
PORC VARCHAR2(55)
STRIKE VARCHAR2(55)
OPTKR VARCHAR2(55)
RECORDNMBR VARCHAR2(55)
EXPDAY DATE
OI VARCHAR2(55)
VOL VARCHAR2(55)
BID VARCHAR2(55)
ASK VARCHAR2(55)
EST VARCHAR2(55)
DLTA VARCHAR2(55)
BIDOPRK VARCHAR2(55)
ASKOPRK VARCHAR2(55)
BIDUNOV VARCHAR2(55)
ASKUNOV VARCHAR2(55)
RVOPT VARCHAR2(55)
TENPERC VARCHAR2(55)
TENPERC_MINUS VARCHAR2(55)
CCRK VARCHAR2(55)
CCBE VARCHAR2(55)
CCRV VARCHAR2(55)
CCPA VARCHAR2(55)
CCPRT VARCHAR2(55)
CCMAX VARCHAR2(55)
ITMOTM VARCHAR2(55)
THETA VARCHAR2(55)
BIDIMPLIE VARCHAR2(55)
ASKIMPLIE VARCHAR2(55)
VOLFADJ VARCHAR2(55)
FORECASTPRTBSTRK VARCHAR2(55)
BIDPRBSTRK VARCHAR2(55)
ASKPRBSTRK VARCHAR2(55)
PCTDBLE VARCHAR2(55)
MARGIN VARCHAR2(55)
YLDMGN VARCHAR2(55)
MXMGN VARCHAR2(55)
ROBUY VARCHAR2(55)
ROWRITE VARCHAR2(55)
ROCC VARCHAR2(55)
GAMMA VARCHAR2(55)
VEGA VARCHAR2(55)
RHO VARCHAR2(55)
TIMPRPCT VARCHAR2(55)
BIDASKPCNT VARCHAR2(55)
PRICEDATE DATE
22:33:56 SQL>
22:33:57 SQL>
Then we used SQL*Loader to copy the CSV data in the table:
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ sqlldr trade/t control=vlc_append.ctl \
skip=12 bindsize=20971520 readsize=20971520 rows=123456
SQL*Loader: Release 11.1.0.6.0 - Production on Sat Jun 20 22:37:05 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
Commit point reached - logical record count 1477
Commit point reached - logical record count 2954
Commit point reached - logical record count 4431
Commit point reached - logical record count 5908
Commit point reached - logical record count 7385
Commit point reached - logical record count 8862
Commit point reached - logical record count 10339
Commit point reached - logical record count 11816
Commit point reached - logical record count 13293
Commit point reached - logical record count 14770
Commit point reached - logical record count 16247
Commit point reached - logical record count 17724
Commit point reached - logical record count 19201
Commit point reached - logical record count 20678
Commit point reached - logical record count 22155
Commit point reached - logical record count 23632
Commit point reached - logical record count 25109
Commit point reached - logical record count 26586
Commit point reached - logical record count 28063
Commit point reached - logical record count 29540
Commit point reached - logical record count 31017
Commit point reached - logical record count 32494
Commit point reached - logical record count 33971
Commit point reached - logical record count 34162
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$ grep loaded vlc_append.log
Table VLC, loaded from every logical record.
34162 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
oracle@zareason:~/vlsql/md/csv/VL3$
oracle@zareason:~/vlsql/md/csv/VL3$
After we filled the vlc table with call options (and vlp which
captured put options) we were ready to start the next step in our effort.
Transform VL Options Data
With vlc and vlp full of data from the CSV files, we relied on syntax
similar to that displayed below to transform the options data:
UPDATE vlc
SET pricedate = TO_DATE('06/19/2009','MM/DD/YYYY')
WHERE pricedate IS NULL
/
UPDATE vlp
SET pricedate = TO_DATE('06/19/2009','MM/DD/YYYY')
WHERE pricedate IS NULL
/
INSERT INTO porcs_c
(
pricedate
,company
,cmrk
,techrank
,updowncm
,histvo
,indust
,cmprice
,volf
,tkr
,porc
,strike
,optkr
,expday
,oi
,vol
,bid
,ask
,bidoprk
,askoprk
,bidunov
,askunov
,itmotm
)
SELECT
pricedate
,company
,cmrk
,techrank
,updowncm
,histvo
,indust
,cmprice
,volf
,tkr
,porc
,strike
,optkr
,expday
,oi
,vol
,bid
,ask
,bidoprk
,askoprk
,bidunov
,askunov
,itmotm
FROM vlc
/
INSERT INTO porcs_p
(
pricedate
,company
,cmrk
,techrank
,updowncm
,histvo
,indust
,cmprice
,volf
,tkr
,porc
,strike
,optkr
,expday
,oi
,vol
,bid
,ask
,bidoprk
,askoprk
,bidunov
,askunov
,itmotm
)
SELECT
pricedate
,company
,cmrk
,techrank
,updowncm
,histvo
,indust
,cmprice
,volf
,tkr
,porc
,strike
,optkr
,expday
,oi
,vol
,bid
,ask
,bidoprk
,askoprk
,bidunov
,askunov
,itmotm
FROM vlp
/
The steps and syntax listed above describe the general method we used
to fill the tables porcs_c and porcs_p from CSV files we downloaded
each day.
Some of the details of filling both porcs_c and porcs_p from CSV files
are proprietary and thus omitted from this paper.
We can, however, show the amount of data we have collected.
The queries below list the amount of data we have gathered thus far in
both porcs_c and porcs_p:
23:20:13 SQL>
23:20:13 SQL> l
1 SELECT TO_CHAR(pricedate,'YYYY-MM'),COUNT(TO_CHAR(pricedate,'YYYY-MM'))
2 FROM porcs_c
3 GROUP BY TO_CHAR(pricedate,'YYYY-MM')
4* ORDER BY TO_CHAR(pricedate,'YYYY-MM')
23:20:14 SQL> /
TO_CHAR COUNT(TO_CHAR(PRICEDATE,'YYYY-MM'))
------- -----------------------------------
2003-02 828706
2003-03 954166
2003-04 940774
2003-05 924963
2003-06 971897
2003-07 1045454
2003-08 920535
2003-09 1095871
2003-10 1239138
2003-11 1041251
2003-12 1170174
2004-01 1035176
2004-02 968493
2004-03 1256077
2004-04 332646
2004-05 625697
2004-06 1215128
2004-07 1013141
2004-08 1101926
2004-10 65000
2007-10 315606
2007-11 1483933
2007-12 1593079
2008-01 1835203
2008-02 1885670
2008-03 2087839
2008-04 2288295
2008-05 1339221
2008-06 2070624
2008-07 2305573
2008-08 2226775
2008-09 2264874
2008-10 2725266
2008-11 2473364
2008-12 2809052
2009-01 2258234
2009-02 2218512
2009-03 2576386
2009-04 2463333
2009-05 2364456
40 rows selected.
Elapsed: 00:01:46.99
A chart of the above report is displayed below.
The counts for puts are nearly identical so they are not displayed:
Notice that we have a large gap in our data.
This data is simply not available to us.
23:22:05 SQL>
23:22:50 SQL>
23:22:50 SQL> SELECT COUNT(pricedate)FROM porcs_c;
COUNT(PRICEDATE)
----------------
60331508
Elapsed: 00:00:59.39
23:24:07 SQL>
23:24:23 SQL>
23:25:23 SQL>
23:25:24 SQL>
23:25:24 SQL> l
1 SELECT TO_CHAR(pricedate,'YYYY-MM'),COUNT(TO_CHAR(pricedate,'YYYY-MM'))
2 FROM porcs_p
3 GROUP BY TO_CHAR(pricedate,'YYYY-MM')
4* ORDER BY TO_CHAR(pricedate,'YYYY-MM')
23:25:25 SQL> /
TO_CHAR COUNT(TO_CHAR(PRICEDATE,'YYYY-MM'))
------- -----------------------------------
2003-02 825322
2003-03 951415
2003-04 939670
2003-05 926197
2003-06 975568
2003-07 1046564
2003-08 919801
2003-09 1096446
2003-10 1242371
2003-11 1043345
2003-12 1172303
2004-01 1041257
2004-02 971822
2004-03 1258312
2004-04 333671
2004-05 621764
2004-06 1153453
2004-07 1008096
2004-08 1094053
2007-10 253254
2007-11 1483970
2007-12 1593105
2008-01 1835232
2008-02 1709651
2008-03 1879090
2008-04 2288286
2008-05 1339231
2008-06 2070522
2008-07 2305734
2008-08 2226740
2008-09 2264998
2008-10 2725272
2008-11 2473965
2008-12 2809273
2009-01 2258972
2009-02 2218577
2009-03 2576254
2009-04 2462821
2009-05 2364429
39 rows selected.
Elapsed: 00:01:26.32
23:26:53 SQL>
23:29:08 SQL>
23:29:09 SQL> SELECT COUNT(pricedate)FROM porcs_p;
COUNT(PRICEDATE)
----------------
59760806
Elapsed: 00:01:17.42
23:30:35 SQL>
23:31:59 SQL>
We can see that we have roughly 39 months worth of data.
The total number of rows is 60331508 + 59760806 rows.
This is 120 million rows of data.
Certainly this is enough data to determine if Value Line provides us
with predictive attributes when we buy their data.
Also it is enough data for SVM to find correlations, assuming they
exist, between option-contract-attributes and corresponding future
price behavior.
With porcs_c and porcs_p full of 120 million rows of data, we consider
the Transform (the T in ETL) step to be done.
Load VL Options Data
We consider the L in ETL to correspond with the task of filling a fact
table with data to be query-able.
The syntax displayed below does this for us:
--
-- cr_vlm20.sql
--
-- Builds a table full of data useful for backtesting
CONNECT u2/u
ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD';
CREATE TABLE vlm10 COMPRESS AS SELECT
'c_'||tkr||strike||expday||pricedate optid
,'c_'||tkr||strike||expday optsig
,1 porc,ask,bid,pricedate
FROM porcs_c
WHERE ask != 0 -- We dont divide by 0
ORDER BY 'c_'||tkr||strike||expday||pricedate
/
INSERT INTO vlm10 SELECT
'p_'||tkr||strike||expday||pricedate optid
,'p_'||tkr||strike||expday optsig
,-1 porc,ask,bid,pricedate
FROM porcs_p
WHERE ask != 0 -- We dont divide by 0
ORDER BY 'p_'||tkr||strike||expday||pricedate
/
ANALYZE TABLE vlm10 ESTIMATE STATISTICS SAMPLE 9 PERCENT;
CREATE TABLE vlmpg COMPRESS AS SELECT optid,pg20,porc,pdate FROM
(
SELECT
optid
,porc
,pricedate pdate
,(LEAD(bid,10,bid)OVER(PARTITION BY optsig ORDER BY optid)-ask)/ask pg20
,(LEAD(pricedate,20,pricedate)OVER(PARTITION BY optsig ORDER BY optid)) pdate20
FROM vlm10
WHERE ask != 0 -- We dont divide by 0
ORDER BY optid
)
-- Avoid large date gaps in the data
WHERE pdate20 - pdate < 35
/
ANALYZE TABLE vlmpg ESTIMATE STATISTICS SAMPLE 9 PERCENT;
-- Filter out contracts we know we do not want.
-- Look at the WHERE clauses below:
CREATE TABLE vlm20 COMPRESS AS SELECT
'c_'||tkr||strike||expday||pricedate optid
,1 porc
,tkr
,pricedate pdate -- shorten the name of pricedate
,expday - pricedate days2exp
,0+REPLACE(volf,'%','')volf
,0+ask ask
,0+cmprice cmprice
,0+oi oi
,0+vol vol
,(ask-bid)/ask ask_bid_r
,0+bidoprk bidoprk
,0+askoprk askoprk
,0+REPLACE(bidunov,'%','')bidunov
,0+REPLACE(askunov,'%','')askunov
,0+REPLACE(itmotm,'%','')itmotm
,0+cmrk cmrk
,0+techrank techrank
FROM porcs_c
WHERE 0+ask>0 AND 0+cmprice>0
AND 0+ask BETWEEN 2.0 AND 20.0 AND (ask-bid)/ask < 0.07
AND expday - pricedate > 33
AND (ABS(cmprice - strike)/cmprice) < 0.2
AND 0+oi+vol > 0
/
INSERT INTO vlm20 SELECT
'p_'||tkr||strike||expday||pricedate optid
,-1 porc
,tkr
,pricedate pdate -- shorten the name of pricedate
,expday - pricedate days2exp
,0+REPLACE(volf,'%','')volf
,0+ask ask
,0+cmprice cmprice
,0+oi oi
,0+vol vol
,(ask-bid)/ask ask_bid_r
,0+bidoprk bidoprk
,0+askoprk askoprk
,0+REPLACE(bidunov,'%','')bidunov
,0+REPLACE(askunov,'%','')askunov
,0+REPLACE(itmotm,'%','')itmotm
,0+cmrk cmrk
,0+techrank techrank
FROM porcs_p
WHERE 0+ask>0 AND 0+cmprice>0
AND 0+ask BETWEEN 2.0 AND 20.0 AND (ask-bid)/ask < 0.07
AND expday - pricedate > 33
AND (ABS(cmprice - strike)/cmprice) < 0.2
AND 0+oi+vol > 0
/
ALTER TABLE vlm20 MODIFY optid VARCHAR2(33);
ANALYZE TABLE vlm20 ESTIMATE STATISTICS SAMPLE 9 PERCENT;
-- Now build back-testing table full of attributes
CREATE TABLE vlmatt COMPRESS AS SELECT
b.optid
,b.porc
,tkrnum
,b.pdate -- Cannot have this in sme or bme, but I need it to build them
,0 + TO_CHAR ( ROUND(b.pdate, 'MM'),'MM' ) mnth_round
,days2exp
,sysdate - b.pdate dayspast
,b.ask_bid_r
,b.askoprk
,b.askunov
,b.bidoprk
,b.bidunov
,b.ask
,b.cmprice
,b.cmrk
,b.itmotm
,b.oi
,b.techrank
,b.vol
,b.volf
,pg20
,CASE WHEN pg20 < 0.2 THEN 'nup' ELSE 'up' END aapg20
FROM vlm20 b, vlmpg p, tkr2num t
WHERE b.optid = p.optid AND b.tkr = t.tkr
/
ANALYZE TABLE vlmatt ESTIMATE STATISTICS SAMPLE 9 PERCENT;
CREATE TABLE vfact COMPRESS AS SELECT
pg20
,porc
,0+TO_CHAR(pdate,'YYYYMMDD')day_id
,mnth_round
,askoprk
,CASE WHEN ABS(askunov) < 60 THEN ROUND(askunov,-1) WHEN askunov < -60 THEN -60 ELSE 60 END askunov
,bidoprk
,CASE WHEN ABS(bidunov) < 60 THEN ROUND(bidunov,-1) WHEN bidunov < -60 THEN -60 ELSE 60 END bidunov
,cmrk
,techrank
FROM vlmatt
/
ANALYZE TABLE vfact ESTIMATE STATISTICS SAMPLE 9 PERCENT;
Creation of the the vfact table completes the "L" phase of our ETL effort.
We then focus on running a few simple queries against vfact.
We are on a quest for obvious correlations between dimension values in
that table and Average 20 Day Percent Gain.
Results:
00:49:48 SQL>
00:49:54 SQL>
00:49:54 SQL> @qry_vfact.sql
00:49:55 SQL> --
00:49:55 SQL> -- qry_vfact.sql
00:49:55 SQL> --
00:49:55 SQL>
00:49:55 SQL> -- Compare calls to puts
00:49:55 SQL> SELECT
00:49:55 2 porc, ROUND(AVG(pg20),3),COUNT(*)
00:49:55 3 FROM vfact GROUP BY porc;
PORC ROUND(AVG(PG20),3) COUNT(*)
---------- ------------------ ----------
1 -.073 4291658
-1 -.017 4188353
Elapsed: 00:00:00.02
The above output was charted in a spreadsheet:
Compare the above query to the query below:
00:49:55 SQL>
00:49:55 SQL> -- Maybe askoprk IN (1,2) gives better results?
00:49:55 SQL> SELECT
00:49:55 2 porc, ROUND(AVG(pg20),3),COUNT(*)
00:49:55 3 FROM vfact
00:49:55 4 WHERE askoprk IN (1,2) GROUP BY porc
00:49:55 5 /
PORC ROUND(AVG(PG20),3) COUNT(*)
---------- ------------------ ----------
1 -.054 948785
-1 -.041 938635
Elapsed: 00:00:00.00
The above output was charted in a spreadsheet:
It appears that when askoprk is either 1 or 2, it yields better
performing calls and worse performing puts.
00:49:55 SQL>
00:49:55 SQL> -- See if cmrk is predictive
00:49:55 SQL> SELECT
00:49:55 2 porc, cmrk, ROUND(AVG(pg20),3),COUNT(*)
00:49:55 3 FROM vfact
00:49:55 4 GROUP BY porc, cmrk ORDER BY porc, cmrk;
PORC CMRK ROUND(AVG(PG20),3) COUNT(*)
---------- ---------- ------------------ ----------
-1 0 .021 589739
-1 1 .006 445170
-1 2 -.016 877977
-1 3 -.037 1744745
-1 4 -.021 434807
-1 5 .009 95915
1 0 -.116 537111
1 1 -.088 472675
1 2 -.074 941900
1 3 -.061 1825549
1 4 -.058 424037
1 5 -.071 90386
12 rows selected.
Elapsed: 00:00:00.01
The above output was charted in a spreadsheet:
It appears that cmrk is more predictive for calls than it is for puts.
For calls, I was suprised that cmrk-5 calls performed better than cmrk-1 calls.
00:49:55 SQL>
00:49:55 SQL> -- Perhaps askunov is predictive?
00:49:55 SQL> SELECT
00:49:55 2 porc, askunov, ROUND(AVG(pg20),3),COUNT(*)
00:49:55 3 FROM vfact
00:49:55 4 GROUP BY porc, askunov ORDER BY porc, askunov;
PORC ASKUNOV ROUND(AVG(PG20),3) COUNT(*)
---------- ---------- ------------------ ----------
-1 -60 -.035 14133
-1 -50 -.051 94552
-1 -40 -.07 281441
-1 -30 -.082 350557
-1 -20 -.078 327871
-1 -10 -.043 299264
-1 0 -.003 293087
-1 10 .024 371764
-1 20 .028 416791
-1 30 .028 398466
-1 40 .017 334495
-1 50 .005 271326
-1 60 -.025 734606
1 -60 -.017 13209
1 -50 -.022 57537
1 -40 -.021 218413
1 -30 -.026 356614
1 -20 -.035 407121
1 -10 -.053 412378
1 0 -.075 405638
1 10 -.093 480037
1 20 -.1 498811
1 30 -.106 431890
1 40 -.102 321582
1 50 -.095 227363
1 60 -.083 461065
26 rows selected.
Elapsed: 00:00:00.00
The above output was charted in a spreadsheet:
The attribute askunov seems more predictive for calls than it does for puts.
00:49:55 SQL>
00:49:55 SQL> -- Techrank? It has a nice name. Is it predictive?
00:49:55 SQL> SELECT
00:49:55 2 porc,techrank, ROUND(AVG(pg20),3),COUNT(*)
00:49:55 3 FROM vfact
00:49:55 4 GROUP BY porc,techrank ORDER BY porc,techrank;
PORC TECHRANK ROUND(AVG(PG20),3) COUNT(*)
---------- ---------- ------------------ ----------
-1 0 .001 669645
-1 1 -.02 205593
-1 2 -.02 709231
-1 3 -.019 1883974
-1 4 -.029 587946
-1 5 -.014 131964
1 0 -.093 647787
1 1 -.077 211059
1 2 -.074 746762
1 3 -.071 1957821
1 4 -.059 597633
1 5 -.063 130595
1 -.598 1
13 rows selected.
Elapsed: 00:00:00.00
The above output was charted in a spreadsheet:
The attribute techrank seems more predictive for calls than it does for puts.
For calls, I was suprised that techrank-5 calls performed better than
techrank-1 calls.
00:49:55 SQL>
00:49:55 SQL> -- look at years
00:49:55 SQL> SELECT
00:49:55 2 porc, ROUND(AVG(pg20),3),SUBSTR('x'||day_id,2,4),COUNT(*)
00:49:55 3 FROM vfact
00:49:55 4 GROUP BY porc,SUBSTR('x'||day_id,2,4)
00:49:55 5 ORDER BY porc,SUBSTR('x'||day_id,2,4)
00:49:55 6 /
PORC ROUND(AVG(PG20),3) SUBS COUNT(*)
---------- ------------------ ---- ----------
-1 -.13 2003 842907
-1 -.04 2004 536124
-1 .055 2007 327003
-1 .066 2008 1717196
-1 -.096 2009 765123
1 .014 2003 1033343
1 -.054 2004 633744
1 -.159 2007 358239
1 -.138 2008 1599296
1 -.026 2009 667036
10 rows selected.
Elapsed: 00:00:00.01
The above output was charted in a spreadsheet:
Obviously, puts had a much better return than calls during 2008.
00:49:55 SQL>
00:49:55 SQL> -- look at months
00:49:55 SQL> SELECT
00:49:55 2 porc, mnth_round, ROUND(AVG(pg20),3),COUNT(*)
00:49:55 3 FROM vfact
00:49:55 4 GROUP BY porc,mnth_round
00:49:55 5 ORDER BY porc,mnth_round
00:49:55 6 /
PORC MNTH_ROUND ROUND(AVG(PG20),3) COUNT(*)
---------- ---------- ------------------ ----------
-1 1 .076 373293
-1 2 -.053 438289
-1 3 -.016 460640
-1 4 -.183 357438
-1 5 -.166 300050
-1 6 -.033 401058
-1 7 .072 294085
-1 8 -.064 414379
-1 9 .094 321075
-1 10 .311 181996
-1 11 .088 304826
-1 12 -.146 341224
1 1 -.153 382791
1 2 -.073 448388
1 3 -.059 446156
1 4 .044 360514
1 5 .029 304051
1 6 -.045 425891
1 7 -.145 300545
1 8 -.051 424543
1 9 -.126 333979
1 10 -.185 186725
1 11 -.146 329514
1 12 -.035 348561
24 rows selected.
Elapsed: 00:00:00.00
00:49:55 SQL>
00:51:29 SQL>
The above output was charted in a spreadsheet:
The above results are a bit abnormal when compared with data collected
for stocks over a much wider time span (1998 to 2009):
http://bikle.com/protected/sim
We compare the results of the calls to the stocks in the above link.
During the spring months calls performed well.
This coincides with the pattern we found for stocks between the years 1998 and 2009.
In the autumn though, calls bucked the trend. This is a consequence of
the data we collected; it is dominated by 2008. During the autumn of
2008, stocks and thus calls fell hard.
Mix SVM With VL Options Data
The SVM algorithm is useful for classifying data into two classes.
Specifically we want to divide option contracts (both puts and calls)
into a two classes: a class which has a future value increase and a
class which will not increase. We consider the dividing line to be
20%. Any contract with a future value increase of 20% or more in 20
trading days is in the "up" class. All other contracts will be in
the "nup" class.
A general discussion about the SVM algorithm and its classification
abilities can be found on the web:
We continue this section with a discussion about the script, vlm.rb,
which is displayed below:
#!/usr/bin/env ruby
# vlm.rb
# Demo: ruby -r vlm.rb -e 'Vl.new.write_score_sql_script'
# Ruby script which uses nested loops to generate much SQL.
# The SQL drives the SVM algorithm resident within Oracle Data Miner.
class Vlm
require 'rubygems'
require 'oci8'
def write_score_sql_script
some_dates_a = []
date_sql = "SELECT DISTINCT TO_CHAR(pdate,'YYYY-MM-DD') "
date_sql << "FROM vlmatt WHERE pdate>TO_DATE('2003-03-01','YYYY-MM-DD') "
OCI8.new('u2','u').exec(date_sql){ |r| some_dates_a << r.to_s }
p("Found #{some_dates_a.size.to_s} dates in vlmatt.")
p("Check your directory for a new file.")
some_dates_a.sort.each do |pdate|
sqlplus_syntax = <<ENDsqlplus_syntax1
--
CREATE OR REPLACE VIEW sme AS SELECT
optid
,porc
,mnth_round
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlmatt
WHERE pdate='#{pdate}'
/
CREATE OR REPLACE VIEW bme AS SELECT
optid
,porc
,tkrnum
,mnth_round
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlmatt
WHERE 35+pdate < '#{pdate}'
/
ALTER SYSTEM FLUSH SHARED_POOL;
@cr_bme.sql
PURGE RECYCLEBIN;
--
ENDsqlplus_syntax1
(fha = File.open("vlm_scorer.sql","a")).write(sqlplus_syntax); fha.close
end # some_dates_a.each do
end # def write_score_sql_script
end # class
Vlm.new.write_score_sql_script
# end of vlm.rb
The main idea behind the above script is that it creates a large SQL script.
The large SQL script is named vlm_scorer.sql which makes a series of
calls to another SQL script.
This second SQL script, cr_bme.sql, is called for each day in the past
which we have Value Line Data for.
How many days?
A simple grep command will tell us:
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$ grep cr_bme.sql vlm_scorer.sql | wc -l
732
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
So, the Ruby script vlm.rb is a script which allows use to call
cr_bme.sql for each of the 732 days which we have data for.
We show the first and last calls from vlm_scorer.sql to give you an
idea about how it calls cr_bme.sql:
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$ head -56 vlm_scorer.sql
--
CREATE OR REPLACE VIEW sme AS SELECT
optid
,porc
,mnth_round
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlmatt
WHERE pdate='2003-03-03'
/
CREATE OR REPLACE VIEW bme AS SELECT
optid
,porc
,tkrnum
,mnth_round
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlmatt
WHERE 35+pdate < '2003-03-03'
/
@cr_bme.sql
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$ tail -56 vlm_scorer.sql
--
CREATE OR REPLACE VIEW sme AS SELECT
optid
,porc
,mnth_round
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlmatt
WHERE pdate='2009-05-29'
/
CREATE OR REPLACE VIEW bme AS SELECT
optid
,porc
,tkrnum
,mnth_round
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlmatt
WHERE 35+pdate < '2009-05-29'
/
@cr_bme.sql
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
oracle@zareason:~/vlsql/vlm$
Notice the columns in the bme view. Many of them correspond to
attributes provided by Value Line. It is our hope that some of them
alone or when combined with others will provide predictive power to
the SVM algorithm.
If you want information about the names of columns and their
significance, we suggest that you register for a Value Line Options
Survey account and then contact their support department:
https://www.ec-server.valueline.com/products/product.html
https://www.ec-server.valueline.com/products/elect6.html
Support: vloptions@valueline.com
Turning our attention to the SQL script, we see that the mechanics of
vlm_scorer.sql are simple. The script creates a view named sme.
Next, It creates a view named bme. It then calls cr_bme.sql
And it does this over and over 732 times.
Notice that the columns of the views sme and bme match. Well, except
one column; the most important column, which is aapg20.
In the view sme, aapg20 is NULL; and for good reason; we are
simulating the situation that we do not know what the value of aapg20
is yet. We will rely on SVM to predict the value.
In other words, we are pretending that we don't know the aapg20 value
so we can ask SVM to predict it and then we will determine if SVM can
accurately predict the value of aapg20.
In the view bme, aapg20 is not NULL; we know what it is since it is
derived from the percentage gain of option contracts in the past.
Also notice the WHERE clause of the sme view:
WHERE pdate='2009-05-29'
Obviously this means that sme will contain rows with information about
option contracts for one specific day.
The view bme has a different WHERE clause:
WHERE 35+pdate < '2009-05-29'
This means that bme will contain rows for many days where pdate is 35
days before '2009-05-29'.
Perhaps now it might be obvious how SVM will work in this setup.
The view bme contains many rows which SVM learns from. SVM then
processes this information in a way which allows it to predict the
value of aapg20 in the view sme.
After vlm_scorer.sql creates sme and bme, it calls cr_bme.sql which is
displayed below:
--
-- cr_bme.sql
--
-- usage: cr_bme.sql
-- A script which feeds data to SVM and captures the results in table vlm_predictions
DEFINE target = 'aapg20'
DEFINE model_name = 'vl_model'
DEFINE bldtable = 'bme'
DEFINE scoretable = 'sme'
DEFINE case_id = 'optid'
-- Define a variable to help me exclude some columns from some data mining calls
DEFINE exclude1 = "'&target','&case_id'"
-- Builds an SVM model using pl/sql.
-----------------------------------------------------------------------
-- BUILD THE MODEL
-----------------------------------------------------------------------
-- Cleanup old build data preparation objects for repeat runs
DELETE svmc_miss_num;
DELETE svmc_miss_cat;
DELETE svmc_clip;
DELETE svmc_norm;
-- CREATE AND POPULATE A SETTINGS TABLE
--
-- DROP TABLE svmc_settings ;
-- CREATE TABLE svmc_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(30));
-- DELETE svmc_settings;
-- The default classification algorithm is Naive Bayes. So override
-- this choice to SVM using a settings table.
-- SVM chooses a kernel type automatically. This choice can be overriden
-- by the user. Linear kernel is preferred high dimensional data, and
-- Gaussian kernel for low dimensional data. Here we use linear kernel
-- to demonstrate the get_model_details_svm() API, which applies only for
-- models.
--
-- Do this once and then comment it out.
-- That makes script go faster.
-- -- BEGIN
-- -- -- Populate settings table
-- -- INSERT INTO svmc_settings (setting_name, setting_value) VALUES
-- -- (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines);
-- --
-- -- INSERT INTO svmc_settings (setting_name, setting_value) VALUES
-- -- (dbms_data_mining.svms_kernel_function, dbms_data_mining.svms_gaussian);
-- --
-- -- INSERT INTO svmc_settings (setting_name, setting_value) VALUES
-- -- (dbms_data_mining.svms_kernel_cache_size,320123123);
-- -- COMMIT;
-- -- END;
-- -- /
-- SELECT * FROM svmc_settings;
BEGIN EXECUTE IMMEDIATE 'DROP VIEW svmc_winsor';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
BEGIN EXECUTE IMMEDIATE 'DROP VIEW svmc_build_prep';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
BEGIN EXECUTE IMMEDIATE 'DROP VIEW xformed_build_miss_num';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
BEGIN EXECUTE IMMEDIATE 'DROP VIEW xformed_build_miss_cat';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
--------------------------------
-- PREPARE BUILD (TRAINING) DATA
--
-- 1. Missing Value treatment for all Predictors and
-- 2. Outlier Treatment and
-- 3. Normalization are performed below.
-- NOTE: that unlike SVM regression, the classification target is NOT
-- normalized here.
BEGIN
-- Perform missing value treatment for all predictors
-- create miss tables
-- DBMS_DATA_MINING_TRANSFORM.CREATE_MISS_NUM (miss_table_name => 'svmc_miss_num');
-- DBMS_DATA_MINING_TRANSFORM.CREATE_MISS_CAT (miss_table_name => 'svmc_miss_cat');
-- populate miss tables
DBMS_DATA_MINING_TRANSFORM.INSERT_MISS_NUM_MEAN (
miss_table_name => 'svmc_miss_num',
data_table_name => '&bldtable',
exclude_list => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (&exclude1));
DBMS_DATA_MINING_TRANSFORM.INSERT_MISS_CAT_MODE (
miss_table_name => 'svmc_miss_cat',
data_table_name => '&bldtable',
exclude_list => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (&exclude1));
-- xform input data to replace missing values
DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_NUM(
miss_table_name => 'svmc_miss_num',
data_table_name => '&bldtable',
xform_view_name => 'xformed_build_miss_num');
DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_CAT(
miss_table_name => 'svmc_miss_cat',
data_table_name => '&bldtable',
xform_view_name => 'xformed_build_miss_cat');
-- Perform outlier treatment.
-- create clip table
-- DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP (clip_table_name => 'svmc_clip');
-- populate clip table
DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL (
clip_table_name => 'svmc_clip',
data_table_name => '&bldtable',
tail_frac => 0.025,
exclude_list => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (&exclude1));
-- xform input data to winsorized data
DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP(
clip_table_name => 'svmc_clip',
data_table_name => '&bldtable',
xform_view_name => 'svmc_winsor');
-- create normalization table
-- DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN (norm_table_name => 'svmc_norm');
-- populate normalization table based on winsorized data
DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX (
norm_table_name => 'svmc_norm',
data_table_name => 'svmc_winsor',
exclude_list => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (&exclude1));
-- normalize the original data
DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN (
norm_table_name => 'svmc_norm',
data_table_name => '&bldtable',
xform_view_name => 'svmc_build_prep');
END;
/
---------------------
-- CREATE A NEW MODEL
--
-- Cleanup old model with the same name for repeat runs
BEGIN DBMS_DATA_MINING.DROP_MODEL('&model_name');
EXCEPTION WHEN OTHERS THEN NULL; END;
/
-- Build a new SVM Model
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => '&model_name',
mining_function => dbms_data_mining.classification,
data_table_name => 'svmc_build_prep',
case_id_column_name => '&case_id',
target_column_name => '&target',
settings_table_name => 'svmc_settings');
END;
/
-----------------------------------------------------------------------
-- APPLY/score THE MODEL
-----------------------------------------------------------------------
-- Cleanup old scoring data preparation objects for repeat runs
BEGIN EXECUTE IMMEDIATE 'DROP VIEW xformed_apply_miss_num';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
BEGIN EXECUTE IMMEDIATE 'DROP VIEW xformed_apply_miss_cat';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
BEGIN EXECUTE IMMEDIATE 'DROP VIEW svmc_apply_prep';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
-----------------------
-- PREPARE SCORING DATA
--
-- If the data for model creation has been prepared, then the data
-- to be scored using the model must be prepared in the same manner
-- in order to obtain meaningful results.
--
-- 1. Missing Value treatment for all Predictors and
-- 2. Normalization
-- No outlier treatment will be performed during test and apply. The
-- normalization step is sufficient, since the normalization parameters
-- already capture the effects of outlier treatment done with build data.
--
BEGIN
-- Xform Test data to replace missing values
DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_NUM(
miss_table_name => 'svmc_miss_num',
data_table_name => '&scoretable',
xform_view_name => 'xformed_apply_miss_num');
DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_CAT(
miss_table_name => 'svmc_miss_cat',
data_table_name => '&scoretable',
xform_view_name => 'xformed_apply_miss_cat');
-- Normalize the data to be scored
DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN (
norm_table_name => 'svmc_norm',
data_table_name => '&scoretable',
xform_view_name => 'svmc_apply_prep');
END;
/
-- Get prob and the prediction:
-- DROP TABLE vlm_predictions;
-- CREATE TABLE vlm_predictions(optid VARCHAR2(35),prob_up NUMBER,run_date DATE);
INSERT INTO vlm_predictions (optid,prob_up,run_date)
SELECT
optid
,PREDICTION_PROBABILITY(&model_name,'up' USING *)prob_up
,sysdate
FROM svmc_apply_prep
/
The comments in the above script do a good job of describing what the
script does. We summarize the actions in the list below:
Define a set of sqlplus variables
Prepare database objects required by SVM model builder
Call DBMS_DATA_MINING.CREATE_MODEL()
Prepare database objects required to apply the model
Apply the model via SQL function call: PREDICTION_PROBABILITY()
Collect SVM predictions in table: vlm_predictions
We offer links to some documentation:
After we used vlm.rb to write vlm_scorer.sql we started vlm_scorer.sql
at 2009-06-21 09:58. The script finished at 2009-07-07 09:32 which is
a duration of 16 days. The hardware was a simple desktop Linux box.
Vendor:
http://ZaReason.com
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
http://en.wikipedia.org/wiki/Athlon_64_X2
RAM: 2 GB
1 Disk:
HITACHI Deskstar P7K500 HDP725050GLA360 (0A35415) 500GB 7200 RPM 16MB
Cache SATA 3.0Gb/s 3.5-inch Internal Hard Drive
http://www.newegg.com/Product/Product.aspx?Item=N82E16822145215
The Oracle RDBMS was an unpatched install of 11g:
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
http://www.oracle.com/technology/products/database/oracle11g/index.html
The Linux Kernel:
- Linux zareason 2.6.24-21-generic #1 SMP Mon Aug 25 16:57:51 UTC 2008 x86_64 GNU/Linux
-
http://en.wikipedia.org/wiki/Linux_kernel
The Linux Distribution:
oracle@zareason:/etc$
oracle@zareason:/etc$ head /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=8.04
DISTRIB_CODENAME=hardy
DISTRIB_DESCRIPTION="Ubuntu 8.04.1"
oracle@zareason:/etc$
oracle@zareason:/etc$
http://en.wikipedia.org/wiki/List_of_Ubuntu_releases#Ubuntu_8.04_LTS_.28Hardy_Heron.29
After vlm_scorer.sql finished, we ran a simple set of SQL queries to
analyze the results:
13:41:24 SQL> --
13:41:24 SQL> -- qry_vlm_predictions.sql
13:41:24 SQL> --
13:41:24 SQL>
13:41:24 SQL> -- First look at just the predictions:
13:41:24 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1))
13:41:24 2 FROM vlm_predictions
13:41:24 3 GROUP BY ROUND(prob_up,1)
13:41:24 4 ORDER BY ROUND(prob_up,1)
13:41:24 5 /
ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1))
---------------- -----------------------
0 110977
.1 784944
.2 1210350
.3 936659
.4 692765
.5 671133
.6 769430
.7 969763
.8 1192671
.9 912205
1 73932
11 rows selected.
Elapsed: 00:00:10.78
The above output was charted in a spreadsheet:
18:35:18 SQL> CREATE TABLE vlm_pred_pg COMPRESS AS SELECT
18:35:18 2 v.optid
18:35:18 3 ,prob_up
18:35:18 4 ,pg20
18:35:18 5 ,porc
18:35:18 6 ,pdate
18:35:18 7 FROM vlm_predictions v, vlmpg g
18:35:18 8 WHERE v.optid = g.optid
18:35:18 9 /
Table created.
Elapsed: 00:19:35.62
18:54:54 SQL>
18:54:54 SQL> ANALYZE TABLE vlm_pred_pg ESTIMATE STATISTICS SAMPLE 9 PERCENT;
Table analyzed.
Elapsed: 00:00:14.61
19:04:51 SQL> -- Look for correlation between prediction and pg20
19:04:51 SQL> SELECT ROUND(prob_up,1),ROUND(AVG(pg20),3)avg_pg20,COUNT(ROUND(prob_up,1))
19:04:51 2 FROM vlm_pred_pg
19:04:51 3 GROUP BY ROUND(prob_up,1)
19:04:51 4 ORDER BY ROUND(prob_up,1)
19:04:51 5 /
ROUND(PROB_UP,1) AVG_PG20 COUNT(ROUND(PROB_UP,1))
---------------- ---------- -----------------------
0 -.053 110971
.1 -.068 784887
.2 -.069 1210257
.3 -.053 936567
.4 -.042 692692
.5 -.034 671077
.6 -.031 769358
.7 -.026 969684
.8 -.028 1192522
.9 -.048 912090
1 -.076 73924
11 rows selected.
Elapsed: 00:00:06.47
The above output was charted in a spreadsheet:

Notice the label at the bottom of the spreadsheet:
"Predicted Probability of being 'up'"
Another term for this measure is SVM confidence level, or just
confidence level. It tells us in precise numeric terms how confident
SVM is about its predictions.
The above chart shows that as SVM confidence increases, its ability to
pinpoint higher performing contracts also increases. Well, except at the
ends were SVM is extremely unconfident or where it is over 90% confident.
So, if we accept confidence levels between 0.2 and 0.8, we can be
confident that SVM has the ability to separate higher performing
contracts from lower performing contracts.
Next, we look at puts, confidence levels, average 20 day percent gain
for months which historically tend to be good for puts:
19:04:57 SQL>
19:04:57 SQL> -- Look at puts for rounded months: 1,6,7,8,9
19:04:57 SQL> SELECT ROUND(prob_up,1),ROUND(AVG(pg20),3)avg_pg20,COUNT(ROUND(prob_up,1))
19:04:57 2 FROM vlm_pred_pg
19:04:57 3 WHERE porc = -1
19:04:57 4 AND TO_CHAR(ROUND(pdate,'MM'),'MM')IN('01','06','07','08','09')
19:04:57 5 GROUP BY ROUND(prob_up,1)
19:04:57 6 ORDER BY ROUND(prob_up,1)
19:04:57 7 /
ROUND(PROB_UP,1) AVG_PG20 COUNT(ROUND(PROB_UP,1))
---------------- ---------- -----------------------
0 -.024 26427
.1 -.047 178585
.2 -.041 257261
.3 -.01 213094
.4 .034 159965
.5 .056 140764
.6 .053 156699
.7 .048 199664
.8 .037 266759
.9 .02 189968
1 -.006 14581
11 rows selected.
Elapsed: 00:00:08.31
The above output was charted in a spreadsheet:

The shape of the above chart suggests that for confidence levels
between 0.1 and 0.5, SVM works very well. Then as it becomes more
confident, its ability to locate better performing puts becomes
degraded. Still, however, confidence levels between 0.6 and 0.9 would
lead the investor to money-making-puts.
Also the confidence levels between 0.6 and 0.9 would
lead the investor to puts which perform better than average.
The simple query listed below shows us how to quickly obtain the average:
17:16:07 SQL> -- Look at entire average
17:16:07 SQL> SELECT 100*ROUND(AVG(pg20),3), COUNT(*)
17:16:07 2 FROM vfact
17:16:07 3 WHERE porc = -1 -- porc is 1 for calls, -1 for puts
17:16:07 4 AND mnth_round IN('01','06','07','08','09')
17:16:07 5 /
100*ROUND(AVG(PG20),3) COUNT(*)
---------------------- ----------
2.2 1803890
Elapsed: 00:00:00.00
17:16:07 SQL>
Next, we look at calls, confidence levels, average 20 day percent gain
for months which historically tend to be good for calls:
19:05:05 SQL>
19:05:05 SQL> -- Look at calls for rounded months: 3,4,10,11
19:05:05 SQL> SELECT ROUND(prob_up,1),ROUND(AVG(pg20),3)avg_pg20,COUNT(ROUND(prob_up,1))
19:05:05 2 FROM vlm_pred_pg
19:05:05 3 WHERE porc = 1
19:05:05 4 AND TO_CHAR(ROUND(pdate,'MM'),'MM')IN('03','04','10','11')
19:05:05 5 GROUP BY ROUND(prob_up,1)
19:05:05 6 ORDER BY ROUND(prob_up,1)
19:05:05 7 /
ROUND(PROB_UP,1) AVG_PG20 COUNT(ROUND(PROB_UP,1))
---------------- ---------- -----------------------
0 -.085 15818
.1 -.077 135626
.2 -.115 247059
.3 -.084 162303
.4 -.069 115892
.5 -.057 108087
.6 -.048 112322
.7 -.047 128418
.8 -.049 140420
.9 -.049 102603
1 -.117 6989
11 rows selected.
Elapsed: 00:00:07.31
The above output was charted in a spreadsheet:

The shape of the above chart suggests that for confidence levels
between 0.2 and 0.9, which is most of the levels, that as SVM becomes
more confident, it can pinpoint calls which have a higher average 20
day percent gain.
To drive the point home, we show some data about the average 20 day
percent gain for all calls:

The average 20 day percent gain for all calls is -7.3%.
Notice that for confidence levels 0.6, 0.7, 0.8, and 0.9, SVM easily
picks calls which perform better than the average:
Confidence Avg20day_pct_gain
.6 -.4.8%
.7 -.4.7%
.8 -.4.9%
.9 -.4.9%
It even outperformed calls which Value Line flags as superior:

It is obvious that all of the calls lost money. This is not the fault of SVM.
It is a consequence of the actual data which is dominated by the year 2008.
Stocks, and thus calls, lost much value in 2008.

Since SVM does a better job of classifying calls than puts, can we use
confidence levels of calls to classify puts?
The answer is yes.
We used the script below to correlate call-confidence-levels with
average 20 day percent gain for puts.
16:39:02 SQL> --
16:39:02 SQL> -- qry_cjp.sql
16:39:02 SQL> --
16:39:02 SQL>
16:39:02 SQL> -- Queries calls joined with puts sort by call-prob_up and put_pg20
16:39:02 SQL>
16:39:02 SQL> -- Get some calls 1st
16:39:02 SQL> CREATE OR REPLACE VIEW cjp_c AS SELECT
16:39:02 2 optid optid_c
16:39:02 3 ,REPLACE(optid,'c_','p_')optid_p
16:39:02 4 ,prob_up confidence_c
16:39:02 5 FROM vlm_predictions
16:39:02 6 WHERE optid LIKE 'c%'
16:39:02 7 /
View created.
Elapsed: 00:00:00.73
16:39:03 SQL>
16:39:03 SQL> -- Now, Get some puts
16:39:03 SQL> CREATE OR REPLACE VIEW cjp_p AS SELECT
16:39:03 2 optid optid_p
16:39:03 3 FROM vlm_predictions
16:39:03 4 WHERE optid LIKE 'p%'
16:39:03 5 /
View created.
Elapsed: 00:00:00.05
16:39:03 SQL>
16:39:03 SQL> -- Join em to connect each put with corresponding call-confidence-level
16:39:03 SQL> CREATE OR REPLACE VIEW cjp AS SELECT
16:39:03 2 p.optid_p, confidence_c
16:39:03 3 FROM cjp_c c, cjp_p p
16:39:03 4 WHERE c.optid_p = p.optid_p
16:39:03 5 /
View created.
Elapsed: 00:00:00.05
16:39:03 SQL>
16:39:03 SQL> -- Now join again to get connect pg20 to the put and the call-confidence-level
16:39:03 SQL> CREATE OR REPLACE VIEW cjp_pg20 AS SELECT
16:39:03 2 optid_p, confidence_c, pg20
16:39:03 3 FROM cjp, vlmpg v
16:39:03 4 WHERE optid_p = optid
16:39:03 5 AND TO_CHAR(ROUND(pdate,'MM'),'MM')IN('01','06','07','08','09')
16:39:03 6 /
View created.
Elapsed: 00:00:00.02
16:39:03 SQL>
16:39:03 SQL> -- Correlate confidence_c to pg20
16:39:03 SQL> SELECT ROUND(confidence_c,1)call_confidence,ROUND(AVG(pg20),3)put_avg_pg20,COUNT(*)
16:39:03 2 FROM cjp_pg20
16:39:03 3 GROUP BY ROUND(confidence_c,1)
16:39:03 4 ORDER BY ROUND(confidence_c,1)
16:39:03 5 /
CALL_CONFIDENCE PUT_AVG_PG20 COUNT(*)
--------------- ------------ ----------
0 -.036 10822
.1 0 89556
.2 .03 112371
.3 .047 100076
.4 .048 91157
.5 .047 86579
.6 .041 93258
.7 .028 108699
.8 .018 118745
.9 .036 76421
1 .076 3584
11 rows selected.
Elapsed: 00:05:47.29
16:44:50 SQL>
The above output was charted in a spreadsheet:

The above graph tells us that call confidence levels do not work as
well for puts as they do for calls. Still however, the investor would
make money from puts if he used the lower call confidence levels 0.2,
0.3, and 0.4.
Compare this to the return for all puts which were bought using a
strategy based solely on the month of the year:
17:16:07 SQL> --
17:16:07 SQL> -- qry_puts16789.sql
17:16:07 SQL> --
17:16:07 SQL>
17:16:07 SQL> -- Query the return of puts bought based on a "time of year" strategy developed during study of ETFs:
17:16:07 SQL> -- http://bikle.com/protected/sim
17:16:07 SQL>
17:16:07 SQL> -- Look at monthly averages first
17:16:07 SQL> SELECT ROUND(AVG(pg20),3), mnth_round, COUNT(*)
17:16:07 2 FROM vfact
17:16:07 3 WHERE porc = -1 -- porc is 1 for calls, -1 for puts
17:16:07 4 AND mnth_round IN('01','06','07','08','09')
17:16:07 5 GROUP BY mnth_round
17:16:07 6 ORDER BY mnth_round
17:16:07 7 /
ROUND(AVG(PG20),3) MNTH_ROUND COUNT(*)
------------------ ---------- ----------
.076 1 373293
-.033 6 401058
.072 7 294085
-.064 8 414379
.094 9 321075
Elapsed: 00:00:00.01
The above output was charted in a spreadsheet:

Notice that we have 3 points (months) which were significantly above
zero and two that were below zero.
Now average over all five months.
17:16:07 SQL>
17:16:07 SQL> -- Look at entire average
17:16:07 SQL> SELECT 100*ROUND(AVG(pg20),3), COUNT(*)
17:16:07 2 FROM vfact
17:16:07 3 WHERE porc = -1 -- porc is 1 for calls, -1 for puts
17:16:07 4 AND mnth_round IN('01','06','07','08','09')
17:16:07 5 /
100*ROUND(AVG(PG20),3) COUNT(*)
---------------------- ----------
2.2 1803890
Elapsed: 00:00:00.00
17:16:07 SQL>
The above output was charted in a spreadsheet:

A comparison of the above average and the averages of puts linked to
low call-confidence-levels reveals that low call-confidence-levels can
lead us to superior returns for puts.
So yes, we can use confidence levels of calls to classify puts.
Conclusion
We are interested in the average 20 day price percentage gain of stock
option contracts (puts and calls).
Currently, each day that the options market closes, Value Line provides us over
230,00 rows of option contract data.
We have collected over 100 million of these rows spanning 39 months.
Additionally, each row has a rich set of attributes.
When we load these rows into a fact table and then query it, no single
attribute appears to be strongly correlated with the average 20 day
percentage gain of the rows.
When we feed this data to the SVM algorithm, however, we found that
SVM has an ability to correlate a mixture of the attributes with
future option contract performance.
Also we found that Oracle's implementation of SVM is both easy to
operate and well suited for processing a large amount of data.
The amount of syntax we used to operate SVM in Oracle easily fit
within this web page. This is an important finding; SVM has a
reputation for being difficult to operate due to the programming
effort required to feed data to SVM.
Our operation of the algorithm yielded the observation that
percentage gain of call option contracts is easier to classify than
the gain of puts.
The predictive power of SVM coupled with Value Line data and monthly
patterns of stock market movements provides the options trader an edge
in the market.