GSoC 2015 porting TISEAN package

czwartek, 20 sierpnia 2015

Summary of project

I have not written a post in a long while because I have had family issues. Although my progress was hampered, it was not completely halted. I was able to add some more features and clean some of the old code up. I have added two new sections with five new functions:

Testing for nonlinearity:

surrogates
endtoend
timerev

Spike trains:

spikeauto
spikespec

They are complete with documentation and some demos. The wiki has also been updated with a demo of surrogates. Thus the original plan has been completed apart from some of the functions in the final section Tutorial and randomize. There have been some additional functions: endtoend, spikeauto, spikespec and av_d2. These were added as a logical addition to the existing function.

What happened to `randomize`?

The first time I read the documentation and tried to plan the project I did not realize how big randomize is. Therefore, once I realized that it was a toolbox and not a function I needed to design how this toolbox should work with/in Octave. This took about a week of my time. I have not had any experience designing anything similar to randomize. I considered it an interesting challenge. I tried to create a hybrid that could use C++ and Octave functions to run simulated annealing. I wanted the main loop to be in C++ because this algorithm cannot be vectorized, so it will run slowly on Octave.

My plan was to create some "runner" that can call some abstract methods of the base cost function class. I was able to accomplish this only partly, even though I introduced some polymorphism I only created one cost function (I did not finish the second one). This one cost function and the runner worked properly (as tested against the results from TISEAN). I was even able to do a type of pausing (like in other functions in the package), which allows the user to break the execution without having to kill Octave. All of this worked correctly with the one cost function. My plan was to modify the code to ensure the runner could manage any type of cost functions (that inherit from the base cost function class) once the second cost function was completed.

Afterwards, I planned to create a base abstract class in Octave using classdef and similar keywords. Then I would have created an instance of the C++ class that would run the Octave methods using polymorphism. This was not possible however because classdef (and similar keywords) are yet not fully supported by Octave. They are not parsed properly by help, are not documented and not all of the Matlab functionality is present as of now.

Looking back at this attempt of porting the randomize toolbox I believe that with a good plan it could have taken more than half of my programming time this summer to create a complete toolbox with good tests and documentation. I do not regret trying to port it, but I think I can say that this part was not completed, even though much work was put into it.

All of the code created is placed in 'devel' folder. There are also some tests there that utilize the cgreen (cgreen++ actually) framework. It is almost completely undocumented, as I spent most of my time developing new code.

Other thoughts

I am very pleased with the support of the community. I expected they wouldn't have time to help, and at first I had trouble receiving the help I need, but once I understood what time (depending on who was online) to ask for help and which channel to use it became my biggest help in completing the project. My project needed help from people who knew things about how Octave was written in C++ and not just Octave developers. There were plenty of people who could assist me in just that.

One of the most exciting moments of the whole project was when I saw a problem on the mailing list that I encountered a few weeks earlier and could give some ideas on how to solve it.

If I were to do the project all over again and have to choose between the help of the community and the experience I have today after working on this project for over 3 months I would choose the community, because there are so many knowledgeable people there that know the inns and outs of Octave.

niedziela, 19 lipca 2015

Plans for randomize

This past week I have spent time trying to establish some testing framework for C++ methods and also trying to create a model for what the abstract base classes for cool, cost and permute should look like. I would like them to have the following methods/members:

Cost (Cost_fcn):

const Matrix *series
-> pointer to the series for which a surrogate is generated
double cost
-> current cost
void cost_transform (Matrix *)
-> initial transformation which is used for better calculation of cost
Matrix cost_invert () const
-> assigns to the input variable the inverse of the transformation performed above (to get the actual surrogate, not just a representation of it in a different form)
double cost_update (octave_idx_type nn1, octave_idx_type nn2, double cmax, bool &accept)
-> perform quick update of cost (for a swap of elements under index nn1 and nn2) and decide if cost is smaller than maximum cost (cmax) if yes, accept new cost and return true otherwise reject new cost and return false
double cost_full()
-> performs a full calculation of the cost, takes longer than cost_update()
getters/setters for cost

Cool (Cool_fcn):

double temp
-> holds the current temperature
double cool (double cost, bool accept, bool &end)
-> takes current cost, accept which holds whether the last cost_update() was accepted or not and returns the new temp and sets flag end to indicate if the simulated annealing is over

Permute (Permute_fcn):

Matrix *series
-> holds the pointer to the series, and modifies the series only when exch() is called
void permute (octave_idx_type &n1, octave_idx_type &n2) const
-> generates two indexes n1 and n2 that can be used to calculate Cost_fcn::cost_update()
void exch (octave_idx_type n1, octave_idx_type n2)
-> exchange element under n1 with element under n2 in the series

Those are the methods I intend to have in the base/abstract classes which will be called by the Simulated Annealing runner code. I have not decided what that code should look like, but the current version seems to be working as well as the randomize program from TISEAN package.

I was also hoping to create a subclass of each of those abstract classes built to call GNU Octave code. This would allow the user to create their own functions without having to write anything in C++. However, this idea might not be practical for the following reasons:

The example of Simulated Annealing provided in the TISEAN package (ver. 3.0.1), takes about 0.7 seconds to run using only C++ code and performs on average 900,000 calls to Cost_fcn::cost_update() and calling a simple function in Octave that many times (using the for loop) took 16 seconds
I have trouble deciding how to neatly pass these functions/classes to randomize along with some parameters the user might want to include. I originally thought of using classdef - the new keyword introduced in Octave 4.0.0. I hoped to create an abstract Octave class and then let anyone subclass it to create their own cost, cool and permute classes. The problem is that classdef and all of the associated keywords are not documented, moreover according to Carnë Draug the help function will not recognize this new type of Octave class. So even if all of the needed functionality was available in Octave I might not be able to document it for the user

If obstacle 2. can be overcome it might still be beneficial for the package to create this type of functionality, regardless of how long the code will execute.

This week I plan to refine the design of the abstract classes as well as port more of the cost function options from TISEAN.

[Update]: I modified the design a bit and updated this post to fit the new code.

czwartek, 9 lipca 2015

Progress report and plans

So far my progress has been as planned. Before the end of the midterm evaluation I was able to publish on my repository version 0.2.0 of the package, which included all of the functions from section Dimensions and entropies from the TISEAN documentation. As I mentioned in my previous post the functions that needed to be ported in this section are slightly different from what I wrote in my outline. The ported functions are:

d2
d1
boxcount
c2t
c2g
c2d
av_d2

I also wrote demos for most of those functions and updated the tutorial on the wiki page.

The first part of this week I spent improving on the build process. The function __c2g__ relies on C++ lambdas to work, therefore a configure script needed to be introduced to ensure the compiler has this capability. As was suggested by John Eaton, I tried to make the impact of not having that capability as small as possible. Currently if the compiler does not recognize C++ lambdas simply __c2g__ is not compiled and the function c2g does not work.

The plans

I was hoping to port all of the functions in the next section, Testing for Nonlinearity, by the end of the week. This might not be possible as randomize turned out to be a bigger function than I anticipated. It is actually not a function at all but, as the author of the TISEAN documentation puts it, "a toolbox". It generates surrogate data using simulated annealing. It needs to be supplied with three functions:

the cost function
the cooling function -- how the temperature decreases
the permutation function -- what to change every step

So currently if the user wants their own version of any of the functions above the user needs to write it in FORTRAN. My goal for this project would be to allow the user to write (use) their own octave function. The SA algorithm is an iterative method so using Octave code is not a good idea (as each line must be parsed when using for or while loops). As far as I understand the samin routine from the optim package will not suffice as it does not generate surrogate data, and has fewer options. Due to the size of this function it might take me some time to complete it.

I plan to tackle this problem as follows: I will rewrite in C++ the equivalent function to randomize_auto_exp_random and then try to refactor and modify the code to accept other functions. I plan to include all of the functions that are available in TISEAN in the Octave package, either through rewriting them or through linking to them. And I would like to make it easy for new functions to be added.

Further reading on randomize is available on the TISEAN documentation in the General Constrained Randomization, Surrogates paper Appendix, randomize description and randomize function extension description.

piątek, 26 czerwca 2015

Progress report

The main goal of this post will be to create a progress report before the coming midterm assessment.
As I mentioned before I planned to complete the Dimensions and entropies section of the TISEAN documentation. This seems to be still a realistic goal.

Currently I have ported d2, av-d2, c2g, c2t along with writing documentation and demos for them. The current state of the tests needs improvement because they rely heavily on external files generated using the corresponding TISEAN programs. Because most of those functions/programs are closely linked I plan to improve on this feature once functions from the entire section are ported.

Currently I am working on c1 which already passes it's test. Once I complete it and write the documentation and demo, the only programs/functions that will need to be ported are boxcount and c2d. Once they are complete I plan to release version 0.2.0.

My elaborated proposal located on the octave wiki states that I planned to also port c2. Although the source code for such a program does exist in the TISEAN package (ver. 3.0.1), it does not seem to be mentioned in the documentation. Furthermore, installing the package on a computer does not give access to this program. Also, it seems to be redundant with other programs in the package. Therefore, I will not port it.

niedziela, 14 czerwca 2015

Improving the code

TISEAN was originally written as a command line set of programs. Because of this all the code is not very portable and many variables are global ones. So far this has been dealt with by creating local variables and extending the number of variables in function calls (in some cases up to 11). This is not optimal for code clarity, ease of maintaince and because many of the variables are passed as values (they are parameters) it also caused a slight slowdown in execution speed.

Due to all of these downsides I have contemplated possible solutions which I will attempt to describe.

Using structs

One idea that came to mind is to pack all of the global variables into a struct and pass the struct to all of the functions and obtain the global variables from this struct. This solution certainly solves the problem of passing so many parameters to functions. However, it does not improve code portability because every *.cc oct-file function needs its own struct. This solution is also problematic because all of the names of global variables have to be referenced now through the struct so center[i][j] would become parameter_struct.center[i][j] (obviously the name of the struct could be as short as p).

Using classes

Another quite simple solution would be to create a class. This class would have data members that were previously the global variables and function members that were the old functions called from old main(). As there are similarities between different TISEAN programs, it could be possible to even create a prototype class and inherit from it.

There are however downsides to this option as well. First of all, Octave code guidelines specifies that classes should be in separate files. This would mean creating 2 more files for each program that was ported using the C++ wrapper. Apart from that, the memory might have to allocated using new/delete, because the preferred method of using the macro OCTAVE_LOCAL_BUFFER might be difficult (or impossible) to apply to this case. This objection can be worked around in other ways, such as using Array classes to allocate the data and then get a pointer to them using fortran_vec().

Summary

Performing the aforementioned code improvement, although helpful, is not critical. Therefore any attempts to implement it will be deferred until after the functions outlined at the beginning of the project are complete.

Timeline update

So far I have been giving progress reports on the TISEAN porting project. This time, however, I would like to also compare the outlined schedule for the project with the actual progress made.

Since the last post I have additionally ported:

xzero
lyap_r
lyap_k
lyap_spec

In one of my first posts I stated that I would like to finish Dimensions and Entropies before the midterm assessment. Currently I have finished up Lyapunov Exponents and I plan to start working on Dimensions and Entropies this week. Since there are 2+ weeks to the Midterm Assessment I believe it is possible to complete all of the goals for this section of the project as planned.

niedziela, 7 czerwca 2015

Analyzing lfo-run

I have written tests that compare lfo-run from TISEAN to the ported version lfo_run. The test that uses amplitude.dat works perfectly, but when I analyzed the results both programs/function gave for henon (Henon Maps) I ran into some problems. I will attempt to describe them.

Input data

The problems occur when analyzing a 1000 element Henon map (henon(1000)). For all of the implementations if I used a simple call with default parameters (m = 2, d= 1) the programs would quit due to a matrix singularity. The problems arose when (m = 4, d= 6) was used. With these parameters the program gave various results for various implementation methods.

It is important to note that the prediction that I was testing tried to predict 1000 future elements (default for all implementations) on the basis of given 1000 elements.

Implementations

There are 3 implementations I used:

The TISEAN implementation (uses lfo-run)
The implementation similar to 1. but compiled as c++ and wrapped in enough code to run as m-file (uses __lfo_run__ and invert_matrix())
The implementation that uses Matrix::solve() method

I tried to find out if maybe method 1. and 2. do not differ due to a bug that was introduced while porting. I therefore ported it twice (the second time to a very rudimentary stage) and both times the same results were encountered. I do not understand why there is a discrepancy between between these two implementations.

Discrepencies

Since the goal of this project is to port TISEAN functions I have compare implementation 1. with 2. and 1. with 3, to see what differences I come across.

A comparison between implementation 1. and 2. results in an error from implementation 2. The function generates about 700 elements (of the default 1000) and then throws an error that the forecast has failed.

The comparison between implementation 1. and 3. is much more fruitful as the results are the same for about 150 elements and then they begin to differ (see Fig. 1.)

Fig. 1 Comparison between the TISEAN implementation and using Matrix::solve() in TISEAN package from Octave

These results can be achieved by cloning the repo, doing make run and running the script:

cd tests/lfo_run/; test_lfo_run

Analysis

Before I give my suggestions for what is the cause for these discrepancies I would like to discuss another interesting discrepancy. This discrepancy is the maximum difference between the solution of the equation system obtained from implementation 2. and 3. When using implementation 3. for the forecast this value was 8.5e-14, but when using implementation 2. for the forecast this difference was 5e-13.

I believe this is because the computational error is accumulated throughout the program. Each new forecast point is dependent on the previous ones. Moreover the Kahan algorithm (compensated summation) is never used in the TISEAN implementations. Even matrix multiplication (as seen e.g. in multiply_matrix()) uses the simple, but error accumulating for(...) sum += vec[i].

As to why implementation 1. and 2. give different results I have two theories: either there still is a bug which I was unable to detect, or some compilation difference (e.g. a linked library) between the TISEAN program written in C (lfo-run) and the TISEAN package function written in C++(__lfo_run__).

Summary

The question that poses itself is whether this warrants a rewriting of other TISEAN function that use the simple summation, or if this problem can be ignored. The authors of TISEAN said in their introduction that blindly using the programs they wrote may result in unintended or even wrong results. Trying to predict 1000 elements of a 1000 element Henon Map using first order local linear prediction might be considered a bad use-case.

Progress report

Since the last post I wrote a tutorial on http://wiki.octave.org/TISEAN_package. I also ported:

lzo_gm
lzo_run
ikeda
lfo_run
lfo_ar
lfo_test
rbf
polynom

During my work I discovered that polynom has similar functions (polynomp, polypar, polyback) which provide extra options for performing polynomial fits. I will not include those functions in the project now, but they have a high priority once I finish all of the functions I outlined for this project.

These newly ported functions aren't completely polished (some need demos and documentation) but they pass tests and don't have memory leaks. Once I clean these functions up and port xzero the last function in this section, I intend to create version 0.1.0 of the package. With this version I intend to branch out the repo to have a 'devel' and a 'stable' branch.

Afterwards, I will add more information to the tutorials on the wiki page.

środa, 27 maja 2015

Milestone

I am happy to announce that I completed the first two sections.

Nonlinear Noise Reduction
From this section I added:

ghkss

Thus deprecating project. It is important to note that data allocation in ghkss (only) is done via new/delete because of the way the function was originally written. In the future this will be replaced with OCTAVE_LOCAL_BUFFER (OLB) macro.

Closer examination has revealed another interesting function in this section: nrlazy, which according to the documentation is similar to lazy. Because of this similarity porting it has low priority.

Phase Space Representation
From this section I added:


mutual
false_nearest
poincare

The function poincare was discovered when further studying the documentation and was omitted in the initial plan, but as it seems important in the documentation it was ported.

The function false_nearest is in a similar situation to ghkss, that is, new/delete is used instead of OLB. This will be improved on in the future.

Since corr does not need to be ported, this section is complete.

Nonlinear Prediction
This next section is well on the way as the following have been ported:


upo
upoembed
lzo_test

The function predict turned out to be essentially the same with lzo_test. This had not been verified so far, but according to the documentation they do essentially the same. The additional option that predict has (flag -v) can be easily replaced using GNU Octave's std when calling lzo_test with parameter r. Therefore porting predict will be most likely unnecessary.

The function upoembed is closely associated with upo. It takes the output of upo and creates a cell array of delay vectors for each orbit produced by upo. It was not mentioned in the original outline, but it is an important function for the package.

The state of upo is not optimal. The original implementation only supported input up to 1e6 data points. This might not be a big problem as calculating upo on a 1e4 henon map takes about 8 seconds, so a 1e6 would take about 800 seconds ~ 15 min. Changing this might be problematic as the main data in the FORTAN program is in a common block (it is stored in a global variable), which cannot contain arrays of variable dimensions. The authors of TISEAN chose 1e6, but because of how the program is written making it unlimited would not be trivial. It might be beneficial to lift this number to 1e8 or 1e9 but one must keep in mind that because of how the FORTRAN program is written it will always allocate a local array of the maximum size possible (currently 1e6 elements). If the maximum input length is lifted to 1e8 or 1e9 the size of the data allocated by Octave and the local copy that the FORTRAN program uses can be a sizable amount of memory to allocate. Moreover it is important to note that each data point will be a real*8 (not a real*4). This brings me to the next point.

FORTRAN data types
This topic has been problematic for me from the very beginning. I had trouble realizing just how the dummy variables work. After some research I found that one can pass -freal-4-real-8 and promote all real*4 to real*8. This is beneficial as the input into a FOTRAN program is passed as doubles. This caused serious issues as I needed to ensure that when I call any TISEAN or SLATEC function/subroutine I needed to use real*4 instead of real*8. The solution I originally used was to copy the input variables into local variables. Apart from eliminating potential bugs and improving code complexity the previously mentioned flag also allows the FORTRAN programs to have the same type of precision expected from GNU Octave programs.

Other things
I spent my time also on other things. I significantly changed the Makefiles, removed all compilation warnings and everywhere except for ghkss and false_nearest moved from new/delete to OCTAVE_LOCAL_BUFFER or some type of Array.

czwartek, 20 sierpnia 2015

Summary of project

What happened to randomize?

Other thoughts

niedziela, 19 lipca 2015

Plans for randomize

czwartek, 9 lipca 2015

Progress report and plans

The plans

piątek, 26 czerwca 2015

Progress report

niedziela, 14 czerwca 2015

Improving the code

Using structs

Using classes

Summary

Timeline update

niedziela, 7 czerwca 2015

Analyzing lfo-run

Input data

Implementations

Discrepencies

Analysis

Summary

Progress report

środa, 27 maja 2015

Milestone

What happened to `randomize`?