Data Mining Valueline Puts Data With SVM Algorithm
Dan Bikle -- http://bikle.com -- 2009-07-25 After I finished the document, Crunch Value Line Option Data With Oracle and Ruby , I asked two questions: 1. "What if I separate the puts-data from the calls-data?" 2. "Will this help SVM run faster and/or be more accurate?" I started this effort by writing an SQL script which prepares puts-data for SVM. The script is listed below:
--
-- prep4svm.sql
--
CONNECT u2/u
ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD';
DROP TABLE vlp10;
CREATE TABLE vlp10 COMPRESS AS SELECT
tkr||strike||expday||pricedate optid
,tkr||strike||expday optsig
,ask,bid,pricedate
FROM porcs_p
WHERE ask != 0 -- We dont divide by 0
AND LENGTH(tkr) <6 -- avoid strange looking tkrs
AND LENGTH(tkr||strike||expday||pricedate) <33
-- AND cmrk='5'
ORDER BY tkr||strike||expday||pricedate
/
ALTER TABLE vlp10 MODIFY optid VARCHAR2(33);
ANALYZE TABLE vlp10 ESTIMATE STATISTICS SAMPLE 9 PERCENT;
DROP TABLE vlppg;
CREATE TABLE vlppg COMPRESS AS SELECT optid,pg20,pdate FROM
(
SELECT
optid
,pricedate pdate
,(LEAD(bid,10,bid)OVER(PARTITION BY optsig ORDER BY optid)-ask)/ask pg20
,(LEAD(pricedate,20,pricedate)OVER(PARTITION BY optsig ORDER BY optid)) pdate20
FROM vlp10
WHERE ask != 0 -- We dont divide by 0
ORDER BY optid
)
-- Avoid large date gaps in the data
WHERE pdate20 - pdate < 35
/
ALTER TABLE vlppg MODIFY optid VARCHAR2(33);
ANALYZE TABLE vlppg ESTIMATE STATISTICS SAMPLE 9 PERCENT;
DROP TABLE vlp20;
CREATE TABLE vlp20 COMPRESS AS SELECT
tkr||strike||expday||pricedate optid
,tkr
,pricedate pdate -- shorten the name of pricedate
,expday - pricedate days2exp
,0+REPLACE(volf,'%','')volf
,0+REPLACE(histvo,'%','')histvo
,0+ask ask
,0+cmprice cmprice
,0+oi oi
,0+vol vol
,(ask-bid)/ask ask_bid_r
,0+bidoprk bidoprk
,0+askoprk askoprk
,0+REPLACE(bidunov,'%','')bidunov
,0+REPLACE(askunov,'%','')askunov
,0+REPLACE(itmotm,'%','')itmotm
,0+cmrk cmrk
,0+techrank techrank
FROM porcs_p
WHERE 0+ask>0 AND 0+cmprice>0
AND 0+ask BETWEEN 2.0 AND 20.0 AND (ask-bid)/ask < 0.07
AND expday - pricedate > 33
AND (ABS(cmprice - strike)/cmprice) < 0.2
AND 0+oi+vol > 0
AND LENGTH(tkr||strike||expday||pricedate) <33
/
ALTER TABLE vlp20 MODIFY optid VARCHAR2(33);
ANALYZE TABLE vlp20 ESTIMATE STATISTICS SAMPLE 9 PERCENT;
-- Now build back-testing table full of attributes
DROP TABLE vlpatt;
CREATE TABLE vlpatt COMPRESS AS SELECT
b.optid
,tkrnum
,b.pdate -- Cannot have this in sme or bme, but I need it to build them
,days2exp
,sysdate - b.pdate dayspast
,b.ask_bid_r
,b.askoprk
,b.askunov
,b.bidoprk
,b.bidunov
,b.ask
,b.cmprice
,b.cmrk
,b.itmotm
,b.oi
,b.techrank
,b.vol
,b.volf
,b.histvo
,pg20
,CASE WHEN pg20 < 0.2 THEN 'nup' ELSE 'up' END aapg20
FROM vlp20 b, vlppg p, tkr2num t
WHERE b.optid = p.optid AND b.tkr = t.tkr
/
ANALYZE TABLE vlpatt ESTIMATE STATISTICS SAMPLE 9 PERCENT;
The duration of the above script was 116 minutes.
Next, I wrote a ruby script to generate a large SQL script:
#!/usr/bin/env ruby
# vlp.rb
# Demo: ruby -r vlp.rb -e 'Vl.new.write_score_sql_script'
# Ruby script which uses nested loops to generate much SQL.
# The SQL drives the SVM algorithm resident within Oracle Data Miner.
class Vlp
require 'rubygems'
require 'oci8'
def write_score_sql_script
some_dates_a = []
date_sql = "SELECT DISTINCT TO_CHAR(pdate,'YYYY-MM-DD') FROM vlpatt"
date_sql << " WHERE pdate BETWEEN TO_DATE('2003-04-01','YYYY-MM-DD') AND TO_DATE('2009-06-05','YYYY-MM-DD')"
OCI8.new('u2','u').exec(date_sql){ |r| some_dates_a << r.to_s }
p("Found #{some_dates_a.size.to_s} dates in vlpatt. Check your directory for a new file.")
some_dates_a.sort.each do |pdate|
sqlplus_syntax = <<ENDsqlplus_syntax1
--
CREATE OR REPLACE VIEW sme AS SELECT
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlpatt
WHERE pdate='#{pdate}'
/
CREATE OR REPLACE VIEW bme AS SELECT
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlpatt
WHERE 35+pdate<'#{pdate}'
/
@cr_bme.sql
PURGE RECYCLEBIN;
--
ENDsqlplus_syntax1
(fha = File.open("vlp_scorer.sql","a")).write(sqlplus_syntax); fha.close
end # some_dates_a.each do
end # def write_score_sql_script
end # class
Vlp.new.write_score_sql_script
The top and bottom of the resulting SQL script are displayed below:
CREATE OR REPLACE VIEW sme AS SELECT
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlpatt
WHERE pdate='2003-04-01'
/
CREATE OR REPLACE VIEW bme AS SELECT
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlpatt
WHERE 35+pdate<'2003-04-01'
/
@cr_bme.sql
PURGE RECYCLEBIN;
-- snip snip --
CREATE OR REPLACE VIEW sme AS SELECT
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlpatt
WHERE pdate='2009-06-05'
/
CREATE OR REPLACE VIEW bme AS SELECT
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlpatt
WHERE 35+pdate<'2009-06-05'
/
@cr_bme.sql
PURGE RECYCLEBIN;
Then, I ran the SQL script.
The duration of the SQL script was about 3.5 days. This was a huge
improvement over the 16 day duration of my previous effort. From that
perspective alone, it makes sense to separate puts-data from
calls-data.
Next, I wrote a SQL script to interpret the results.
Output from the script is displayed below:
23:45:16 SQL> @qry_vlp_predictions.sql
23:45:24 SQL> --
23:45:24 SQL> -- qry_vlp_predictions.sql
23:45:24 SQL> --
23:45:24 SQL>
23:45:24 SQL> -- First look at just the predictions:
23:45:24 SQL> SELECT COUNT ( ROUND ( run_date )) , ROUND ( run_date )
23:45:24 2 FROM vlp_predictions
23:45:24 3 GROUP BY ROUND ( run_date )ORDER BY ROUND ( run_date )
23:45:24 4 /
COUNT(ROUND(RUN_DATE)) ROUND(RUN_
---------------------- ----------
49483 2009-07-14
2705037 2009-07-15
639051 2009-07-16
476288 2009-07-17
236318 2009-07-18
Elapsed: 00:00:07.18
23:45:31 SQL>
23:45:31 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1))
23:45:31 2 FROM vlp_predictions
23:45:31 3 GROUP BY ROUND(prob_up,1)
23:45:31 4 ORDER BY ROUND(prob_up,1)
23:45:31 5 /
ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1))
---------------- -----------------------
0 59296
.1 169621
.2 184730
.3 221885
.4 359259
.5 252299
.6 353005
.7 555355
.8 988589
.9 888923
1 73215
11 rows selected.
Elapsed: 00:00:02.12
23:45:33 SQL>
23:45:33 SQL> -- Join vlp_predictions with vlppg to see how well the predictions did.
23:45:33 SQL> SELECT COUNT(*)FROM vlp_predictions;
COUNT(*)
----------
4106177
Elapsed: 00:00:00.21
23:45:33 SQL>
23:45:33 SQL> -- Do we get a good join?
23:45:33 SQL> SELECT COUNT(*)
23:45:33 2 FROM vlp_predictions v, vlppg g
23:45:33 3 WHERE v.optid = g.optid
23:45:33 4
23:45:33 SQL>
23:45:33 SQL> DROP TABLE vlp_pred_pg;
Table dropped.
Elapsed: 00:00:00.82
23:45:34 SQL> CREATE TABLE vlp_pred_pg COMPRESS AS SELECT
23:45:34 2 v.optid
23:45:34 3 ,prob_up
23:45:34 4 ,pg20
23:45:34 5 ,pdate
23:45:34 6 FROM vlp_predictions v, vlppg g
23:45:34 7 WHERE v.optid = g.optid
23:45:34 8 -- Newer data should give a better SVM result.
23:45:34 9 AND pdate > '2004-01-01'
23:45:34 10 /
Table created.
Elapsed: 00:05:01.64
23:50:36 SQL>
23:50:36 SQL> ANALYZE TABLE vlp_pred_pg ESTIMATE STATISTICS SAMPLE 9 PERCENT;
Table analyzed.
Elapsed: 00:00:05.01
23:50:41 SQL>
23:50:41 SQL> -- Look for correlation between prediction and pg20
23:50:41 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1)),ROUND(AVG(pg20),3)avg_pg20
23:50:41 2 FROM vlp_pred_pg
23:50:41 3 GROUP BY ROUND(prob_up,1)
23:50:41 4 ORDER BY ROUND(prob_up,1)
23:50:41 5 /
ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1)) AVG_PG20
---------------- ----------------------- ----------
0 8277 -.035
.1 65641 -.034
.2 108971 -.037
.3 156784 -.021
.4 296825 -.01
.5 192950 -.013
.6 292783 .008
.7 485257 .01
.8 906942 .021
.9 816193 .011
1 60896 .008
11 rows selected.
Elapsed: 00:00:02.90
23:50:44 SQL>
23:50:44 SQL> -- Look at puts for rounded months: 1,6,7,8,9
23:50:44 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1)),ROUND(AVG(pg20),3)avg_pg20
23:50:44 2 FROM vlp_pred_pg
23:50:44 3 WHERE TO_CHAR(ROUND(pdate,'MM'),'MM')IN('01','06','07','08','09')
23:50:44 4 GROUP BY ROUND(prob_up,1)
23:50:44 5 ORDER BY ROUND(prob_up,1)
23:50:44 6 /
ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1)) AVG_PG20
---------------- ----------------------- ----------
0 1215 .023
.1 23807 -.016
.2 50419 -.024
.3 64156 -.015
.4 75874 -.003
.5 92369 .014
.6 136113 .058
.7 214266 .06
.8 422620 .058
.9 406018 .05
1 35399 .062
11 rows selected.
Elapsed: 00:00:06.19
23:50:50 SQL>
23:50:50 SQL> -- for spreadsheet:
23:50:50 SQL> -- Look for correlation between prediction and pg20
23:50:50 SQL> SELECT ROUND(prob_up,1)||','||ROUND(100*AVG(pg20),1)||','||COUNT(ROUND(prob_up,1))x
23:50:50 2 FROM vlp_pred_pg
23:50:50 3 GROUP BY ROUND(prob_up,1)
23:50:50 4 ORDER BY ROUND(prob_up,1)
23:50:50 5 /
X
---------------------------------------------------------------------------------------------------
0,-3.5,8277
.1,-3.4,65641
.2,-3.7,108971
.3,-2.1,156784
.4,-1,296825
.5,-1.3,192950
.6,.8,292783
.7,1,485257
.8,2.1,906942
.9,1.1,816193
1,.8,60896
11 rows selected.
Elapsed: 00:00:02.51
23:50:52 SQL>
23:50:52 SQL> -- for spreadsheet:
23:50:52 SQL> -- Look at puts for rounded months: 1,6,7,8,9
23:50:52 SQL> SELECT ROUND(prob_up,1)||','||ROUND(100*AVG(pg20),1)||','||COUNT(ROUND(prob_up,1))x
23:50:52 2 FROM vlp_pred_pg
23:50:52 3 WHERE TO_CHAR(ROUND(pdate,'MM'),'MM')IN('01','06','07','08','09')
23:50:52 4 GROUP BY ROUND(prob_up,1)
23:50:52 5 ORDER BY ROUND(prob_up,1)
23:50:52 6 /
X
---------------------------------------------------------------------------------------------------
0,2.3,1215
.1,-1.6,23807
.2,-2.4,50419
.3,-1.5,64156
.4,-.3,75874
.5,1.4,92369
.6,5.8,136113
.7,6,214266
.8,5.8,422620
.9,5,406018
1,6.2,35399
11 rows selected.
Elapsed: 00:00:06.17
23:50:59 SQL>
I copied some output from the above script into to spreadsheets and charted the results:
It is clear in the above chart that as SVM becomes more confident, its predictions yield more lucrative put contracts.
At this URL,
http://bikle.com/protected/sim/
, we point out that stocks are cold during the months 1,6,7,8,9
The chart below displays SVM behavior for those months:
For a large range of confidence levels, 0.2 to 1.0, higher SVM confidence levels yielded better results.
An options trader using SVM would have gained an average of about 5.5% for each month that he used this strategy.
When we compare the above results to the results described here,
http://bikle.com/protected/vlm
, we see that SVM was more accurate when it was focused solely on puts-data.
Conclusion
It is a good idea to separate puts-data from calls-data before feeding them to SVM for data mining.
In particular the duration of the SVM run was cut over 50%. Also
comparison of charts from the two methods showed a clear increase in
accuracy when the two types of data were separated.