|
The goal of the AutoBayes project is to make statistical data analysis easier and more accessible to scientists by automatically synthesizing efficient data-analysis programs from statistical models that are used for the definition of valid information. Data analysis can be defined as any process that extracts more abstract information from mere data. It includes such diverse tasks as general parameter estimation and curve fitting, clustering and classification, data compression, fusion of heterogeneous data sources, change and anomaly detection in time-series or image data, or image segmentation. Although there are many approaches to data analysis, statistical data analysis is the only mathematically rigorous approach. In statistical data analysis, a statistical model is used to define how much information the data originally contains, and thus, how much statistically valid information can ultimately be extracted from the data. This approach is standard in medical sciences such as epidemiology, where the cost of wrong conclusions can be high, and statistical data analysis is now becoming more widespread within the fields relevant to NASA. Unfortunately, the development of statistical data-analysis programs is expensive and time-consuming, and it requires expertise at the intersection of computer science, statistics, and the application.
AutoBayes uses a notation based on Bayesian networks to specify the statistical model. A Bayesian network is essentially a directed, acyclic graph where the nodes represent random variables, that is, the known (that is, data) or unknown (that is, model parameters) objects of interest, and the edges represent conditional dependency or influence. The network thus encodes the full joint probability distribution over the model. This study has extended the standard notation to handle identically distributed random variables (for example, sensor arrays or image pixels) more efficiently.
Figure 1 illustrates a simple but typical data-analysis problem, a so-called "mixture problem": given the data points, which are specified to have been generated from several Gaussian distributions with unknown mean and variance, summarize each cluster by its most likely mean and variance.
AutoBayes generates programs that apply known statistical techniques, for example, "Expectation Maximization," to solve the given statistical problems. For each technique, the AutoBayes program knows the conditions under which the technique is applicable, the class of data-analysis problems the technique solves, and how the technique is applied to a specific problem. The generation process takes place in the framework of automated theorem proving, guided by the applicability conditions of the statistical techniques. The theorem-proving framework lends some assurance that the generated program is correct.
FY99 saw the completion of the first prototype of the AutoBayes synthesis system. AutoBayes extracts the network from a textual specification language (see figure 2 for a possible model of the mixture problem) and currently generates code in Matlab or C++; other target languages can be added easily. It has been tested on several textbook examples. Researchers expect to apply it to an image-processing problem arising from a project searching for planets around other stars by looking for the dip in brightness as the planet transits in front of the star.
Point of Contact: B. Fischer
(650) 604-2977
fisch@ptolemy.arc.nasa.gov
Back To Top
Previous Paper
Return to Space Technology
Next Paper |
|
Fig. 1. Example problem - mixture of Gaussians.
|
|
Fig. 2. Example specifications.
|
|