Data Mining Valueline Puts Data With SVM Algorithm
Dan Bikle -- http://bikle.com -- 2009-07-25

After I finished the document, Crunch Value Line Option Data With Oracle and Ruby , I asked two questions:

1. "What if I separate the puts-data from the calls-data?"

2. "Will this help SVM run faster and/or be more accurate?"

I started this effort by writing an SQL script which prepares puts-data for SVM.

The script is listed below:


--
-- prep4svm.sql
--

CONNECT u2/u

ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD';

DROP   TABLE vlp10;
CREATE TABLE vlp10 COMPRESS AS SELECT 
 tkr||strike||expday||pricedate optid
,tkr||strike||expday            optsig
,ask,bid,pricedate
FROM porcs_p
WHERE ask != 0 -- We dont divide by 0
AND LENGTH(tkr) <6 -- avoid strange looking tkrs
AND LENGTH(tkr||strike||expday||pricedate) <33
-- AND cmrk='5'
ORDER BY tkr||strike||expday||pricedate 
/

ALTER TABLE vlp10 MODIFY optid VARCHAR2(33);
ANALYZE TABLE vlp10 ESTIMATE STATISTICS SAMPLE 9 PERCENT;

DROP   TABLE vlppg;
CREATE TABLE vlppg COMPRESS AS SELECT optid,pg20,pdate FROM
(
  SELECT
  optid
  ,pricedate pdate
  ,(LEAD(bid,10,bid)OVER(PARTITION BY optsig ORDER BY optid)-ask)/ask     pg20
  ,(LEAD(pricedate,20,pricedate)OVER(PARTITION BY optsig ORDER BY optid)) pdate20
  FROM vlp10
  WHERE ask != 0 -- We dont divide by 0
  ORDER BY optid
)
-- Avoid large date gaps in the data
WHERE pdate20 - pdate < 35
/

ALTER TABLE vlppg MODIFY optid VARCHAR2(33);
ANALYZE TABLE vlppg ESTIMATE STATISTICS SAMPLE 9 PERCENT;

DROP   TABLE vlp20;
CREATE TABLE vlp20 COMPRESS AS SELECT
tkr||strike||expday||pricedate optid
,tkr
,pricedate pdate -- shorten the name of pricedate
,expday - pricedate days2exp
,0+REPLACE(volf,'%','')volf
,0+REPLACE(histvo,'%','')histvo
,0+ask     ask
,0+cmprice cmprice
,0+oi     oi
,0+vol    vol
,(ask-bid)/ask ask_bid_r
,0+bidoprk bidoprk
,0+askoprk askoprk
,0+REPLACE(bidunov,'%','')bidunov
,0+REPLACE(askunov,'%','')askunov
,0+REPLACE(itmotm,'%','')itmotm
,0+cmrk cmrk
,0+techrank techrank
FROM porcs_p
WHERE 0+ask>0 AND 0+cmprice>0
AND   0+ask BETWEEN 2.0 AND 20.0 AND (ask-bid)/ask < 0.07
AND expday - pricedate > 33
AND (ABS(cmprice - strike)/cmprice) < 0.2
AND 0+oi+vol > 0
AND LENGTH(tkr||strike||expday||pricedate) <33
/

ALTER TABLE vlp20 MODIFY optid VARCHAR2(33);

ANALYZE TABLE vlp20 ESTIMATE STATISTICS SAMPLE 9 PERCENT;

-- Now build back-testing table full of attributes

DROP   TABLE vlpatt;
CREATE TABLE vlpatt COMPRESS AS SELECT
b.optid
,tkrnum
,b.pdate -- Cannot have this in sme or bme, but I need it to build them
,days2exp
,sysdate - b.pdate dayspast
,b.ask_bid_r
,b.askoprk
,b.askunov
,b.bidoprk
,b.bidunov
,b.ask
,b.cmprice
,b.cmrk
,b.itmotm
,b.oi
,b.techrank
,b.vol
,b.volf
,b.histvo
,pg20
,CASE WHEN pg20 < 0.2 THEN 'nup' ELSE 'up' END aapg20
FROM vlp20 b, vlppg p, tkr2num t
WHERE b.optid = p.optid AND b.tkr = t.tkr
/

ANALYZE TABLE vlpatt ESTIMATE STATISTICS SAMPLE 9 PERCENT;

The duration of the above script was 116 minutes.

Next, I wrote a ruby script to generate a large SQL script:


#!/usr/bin/env ruby

# vlp.rb

# Demo: ruby -r vlp.rb -e 'Vl.new.write_score_sql_script'

# Ruby script which uses nested loops to generate much SQL.
# The SQL drives the SVM algorithm resident within Oracle Data Miner.

class Vlp
  require 'rubygems'
  require 'oci8'
  def write_score_sql_script
    some_dates_a = []
    date_sql = "SELECT DISTINCT TO_CHAR(pdate,'YYYY-MM-DD') FROM vlpatt"
    date_sql << " WHERE pdate BETWEEN TO_DATE('2003-04-01','YYYY-MM-DD') AND TO_DATE('2009-06-05','YYYY-MM-DD')"
    OCI8.new('u2','u').exec(date_sql){ |r| some_dates_a << r.to_s }
    p("Found #{some_dates_a.size.to_s} dates in vlpatt.  Check your directory for a new file.")
    some_dates_a.sort.each do |pdate|

sqlplus_syntax = <<ENDsqlplus_syntax1
--
CREATE OR REPLACE VIEW sme AS SELECT 
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlpatt
WHERE pdate='#{pdate}'
/
CREATE OR REPLACE VIEW bme AS SELECT 
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlpatt
WHERE 35+pdate<'#{pdate}'
/
@cr_bme.sql
PURGE RECYCLEBIN;
--

ENDsqlplus_syntax1
      (fha = File.open("vlp_scorer.sql","a")).write(sqlplus_syntax); fha.close
    end # some_dates_a.each do
  end # def write_score_sql_script
end # class

Vlp.new.write_score_sql_script


The top and bottom of the resulting SQL script are displayed below:


CREATE OR REPLACE VIEW sme AS SELECT 
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlpatt
WHERE pdate='2003-04-01'
/
CREATE OR REPLACE VIEW bme AS SELECT 
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlpatt
WHERE 35+pdate<'2003-04-01'
/
@cr_bme.sql
PURGE RECYCLEBIN;


-- snip snip --

CREATE OR REPLACE VIEW sme AS SELECT 
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,null aapg20
FROM vlpatt
WHERE pdate='2009-06-05'
/
CREATE OR REPLACE VIEW bme AS SELECT 
optid
,tkrnum
,days2exp
,dayspast
,ask_bid_r
,askoprk
,askunov
,bidoprk
,bidunov
,ask
,cmprice
,cmrk
,itmotm
,oi
,techrank
,vol
,volf
,histvo
,aapg20
FROM vlpatt
WHERE 35+pdate<'2009-06-05'
/
@cr_bme.sql
PURGE RECYCLEBIN;

Then, I ran the SQL script.

The duration of the SQL script was about 3.5 days. This was a huge improvement over the 16 day duration of my previous effort. From that perspective alone, it makes sense to separate puts-data from calls-data.

Next, I wrote a SQL script to interpret the results.

Output from the script is displayed below:



23:45:16 SQL> @qry_vlp_predictions.sql 
23:45:24 SQL> --
23:45:24 SQL> -- qry_vlp_predictions.sql
23:45:24 SQL> --
23:45:24 SQL> 
23:45:24 SQL> -- First look at just the predictions:
23:45:24 SQL> SELECT COUNT ( ROUND ( run_date )) , ROUND ( run_date )
23:45:24   2  FROM vlp_predictions
23:45:24   3  GROUP BY ROUND ( run_date )ORDER BY ROUND ( run_date )
23:45:24   4  /

COUNT(ROUND(RUN_DATE)) ROUND(RUN_
---------------------- ----------
                 49483 2009-07-14
               2705037 2009-07-15
                639051 2009-07-16
                476288 2009-07-17
                236318 2009-07-18

Elapsed: 00:00:07.18
23:45:31 SQL> 
23:45:31 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1))
23:45:31   2  FROM vlp_predictions
23:45:31   3  GROUP BY ROUND(prob_up,1)
23:45:31   4  ORDER BY ROUND(prob_up,1)
23:45:31   5  /

ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1))
---------------- -----------------------
               0                   59296
              .1                  169621
              .2                  184730
              .3                  221885
              .4                  359259
              .5                  252299
              .6                  353005
              .7                  555355
              .8                  988589
              .9                  888923
               1                   73215

11 rows selected.

Elapsed: 00:00:02.12
23:45:33 SQL> 
23:45:33 SQL> -- Join vlp_predictions with vlppg to see how well the predictions did.
23:45:33 SQL> SELECT COUNT(*)FROM vlp_predictions;

  COUNT(*)
----------
   4106177

Elapsed: 00:00:00.21
23:45:33 SQL> 
23:45:33 SQL> -- Do we get a good join?
23:45:33 SQL> SELECT COUNT(*)
23:45:33   2  FROM vlp_predictions v, vlppg g
23:45:33   3  WHERE v.optid = g.optid
23:45:33   4  
23:45:33 SQL> 
23:45:33 SQL> DROP   TABLE vlp_pred_pg;

Table dropped.

Elapsed: 00:00:00.82
23:45:34 SQL> CREATE TABLE vlp_pred_pg COMPRESS AS SELECT
23:45:34   2  v.optid
23:45:34   3  ,prob_up
23:45:34   4  ,pg20
23:45:34   5  ,pdate
23:45:34   6  FROM vlp_predictions v, vlppg g
23:45:34   7  WHERE v.optid = g.optid
23:45:34   8  -- Newer data should give a better SVM result.
23:45:34   9  AND pdate > '2004-01-01'
23:45:34  10  /

Table created.

Elapsed: 00:05:01.64
23:50:36 SQL> 
23:50:36 SQL> ANALYZE TABLE vlp_pred_pg ESTIMATE STATISTICS SAMPLE 9 PERCENT;

Table analyzed.

Elapsed: 00:00:05.01
23:50:41 SQL> 
23:50:41 SQL> -- Look for correlation between prediction and pg20
23:50:41 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1)),ROUND(AVG(pg20),3)avg_pg20
23:50:41   2  FROM  vlp_pred_pg
23:50:41   3  GROUP BY ROUND(prob_up,1)
23:50:41   4  ORDER BY ROUND(prob_up,1)
23:50:41   5  /

ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1))   AVG_PG20
---------------- ----------------------- ----------
               0                    8277      -.035
              .1                   65641      -.034
              .2                  108971      -.037
              .3                  156784      -.021
              .4                  296825       -.01
              .5                  192950      -.013
              .6                  292783       .008
              .7                  485257        .01
              .8                  906942       .021
              .9                  816193       .011
               1                   60896       .008

11 rows selected.

Elapsed: 00:00:02.90
23:50:44 SQL> 
23:50:44 SQL> -- Look at puts for rounded months: 1,6,7,8,9
23:50:44 SQL> SELECT ROUND(prob_up,1),COUNT(ROUND(prob_up,1)),ROUND(AVG(pg20),3)avg_pg20
23:50:44   2  FROM  vlp_pred_pg
23:50:44   3  WHERE TO_CHAR(ROUND(pdate,'MM'),'MM')IN('01','06','07','08','09')
23:50:44   4  GROUP BY ROUND(prob_up,1)
23:50:44   5  ORDER BY ROUND(prob_up,1)
23:50:44   6  /

ROUND(PROB_UP,1) COUNT(ROUND(PROB_UP,1))   AVG_PG20
---------------- ----------------------- ----------
               0                    1215       .023
              .1                   23807      -.016
              .2                   50419      -.024
              .3                   64156      -.015
              .4                   75874      -.003
              .5                   92369       .014
              .6                  136113       .058
              .7                  214266        .06
              .8                  422620       .058
              .9                  406018        .05
               1                   35399       .062

11 rows selected.

Elapsed: 00:00:06.19
23:50:50 SQL> 
23:50:50 SQL> -- for spreadsheet:
23:50:50 SQL> -- Look for correlation between prediction and pg20
23:50:50 SQL> SELECT ROUND(prob_up,1)||','||ROUND(100*AVG(pg20),1)||','||COUNT(ROUND(prob_up,1))x
23:50:50   2  FROM  vlp_pred_pg
23:50:50   3  GROUP BY ROUND(prob_up,1)
23:50:50   4  ORDER BY ROUND(prob_up,1)
23:50:50   5  /

X
---------------------------------------------------------------------------------------------------
0,-3.5,8277
.1,-3.4,65641
.2,-3.7,108971
.3,-2.1,156784
.4,-1,296825
.5,-1.3,192950
.6,.8,292783
.7,1,485257
.8,2.1,906942
.9,1.1,816193
1,.8,60896

11 rows selected.

Elapsed: 00:00:02.51
23:50:52 SQL> 
23:50:52 SQL> -- for spreadsheet:
23:50:52 SQL> -- Look at puts for rounded months: 1,6,7,8,9
23:50:52 SQL> SELECT ROUND(prob_up,1)||','||ROUND(100*AVG(pg20),1)||','||COUNT(ROUND(prob_up,1))x
23:50:52   2  FROM  vlp_pred_pg
23:50:52   3  WHERE TO_CHAR(ROUND(pdate,'MM'),'MM')IN('01','06','07','08','09')
23:50:52   4  GROUP BY ROUND(prob_up,1)
23:50:52   5  ORDER BY ROUND(prob_up,1)
23:50:52   6  /

X
---------------------------------------------------------------------------------------------------
0,2.3,1215
.1,-1.6,23807
.2,-2.4,50419
.3,-1.5,64156
.4,-.3,75874
.5,1.4,92369
.6,5.8,136113
.7,6,214266
.8,5.8,422620
.9,5,406018
1,6.2,35399

11 rows selected.

Elapsed: 00:00:06.17
23:50:59 SQL> 


I copied some output from the above script into to spreadsheets and charted the results:

It is clear in the above chart that as SVM becomes more confident, its predictions yield more lucrative put contracts.

At this URL, http://bikle.com/protected/sim/ , we point out that stocks are cold during the months 1,6,7,8,9

The chart below displays SVM behavior for those months:

For a large range of confidence levels, 0.2 to 1.0, higher SVM confidence levels yielded better results.

An options trader using SVM would have gained an average of about 5.5% for each month that he used this strategy.

When we compare the above results to the results described here, http://bikle.com/protected/vlm , we see that SVM was more accurate when it was focused solely on puts-data.

Conclusion

It is a good idea to separate puts-data from calls-data before feeding them to SVM for data mining.

In particular the duration of the SVM run was cut over 50%. Also comparison of charts from the two methods showed a clear increase in accuracy when the two types of data were separated.