Introduction to Black Box Modeling in Process Industry Use Your Data to Build Simple Models that will Work
Black box or experimental modeling is a method for the development of models based on process data.
Since physical modeling is usually very time consuming, black box modeling is a popular method for gaining insight into the overall (input–output) process behavior. The developed models are usually used for prediction of future process values or in process control applications.
Important aspects of empirical modeling are data pre-conditioning, model complexity, model linearity and model extrapolation. The various black box or empirical modeling techniques are: partial least squares modeling, time series modeling, neural network modeling, fuzzy modeling and neuro-fuzzy modeling.
Deterministic models are designed based on the chemical and physical balances and mechanisms of the process and consequently the model described the internal functional behavior. Black-box models, on the other hand, are designed based on the input–output behavior of the process and consequently the model describes the overall behavior. A black-box model consists of a certain structure of which the parameters are determined by means of experimental results. Therefore, they often are called experimental models. The main properties of black-box models are the structure characteristics, which are: level of detail, degree of non-linearity and the structural way in which dynamics are composed.
Process behavior is usually non-linear. Whether or not the empirical model to be developed should also be non-linear depends on the operating range in which the model will be used.
If the process is controlled and the operating range is small, a linear process model may be an adequate approximation of reality.
The application of the model will determine whether the model needs to be dynamic or static. For control and prediction type applications, models are usually dynamic.
If the process conditions vary over a wide range, there may be a need for a non-linear empirical model. In case of a dynamic non-linear model, there are a few possibilities for developing such a model, for example, a dynamic neural network or a dynamic fuzzy model.
If the empirical dynamic model is linear, one could use for example a time series model. Different model types are available and will be discussed in a subsequent chapter. If the model is linear and static one could, for example, use partial least squares modeling. State space modeling can be used in its linear or non-linear form, depending on the situation.
In all cases the availability of a sufficient number of data points and the quality of the data determines how good the model is that will be developed. If industrial process data is collected, one can almost be certain that there will be sections of data missing or faulty data may be included in the data set. In that case, one needs to pay considerable attention to data pre-conditioning.
Modeling steps
The phases in back-box modeling are:
- System analysis - This phase is similar to deterministic modeling. During this phase the goals and the requirements of the model are formulated and the boundaries are determined. Often, it is already necessary to consider the three most important characteristics of a black-box model: level of detail, degree of non-linearity of the process and the order of the dynamics.
- Data conditioning - Black-box modeling is also called experimental modeling. The models are based on the data available and form the starting point of the design. The data mining and conditioning is therefore a crucial initial step.
- Key variable analysis - The choice of the input and output variables can be based on process insight, but can also be determined by a sensitivity analysis. Important is that the input space is not too large and the variables as much as possible mutually independent. By othogonalization the number of variables can be reduced.
- Model structure design - When a selection for a certain type of black-box model has been made, choices have to be made about handling model-order and non-linearity. The number of parameters is a good measure of the level of detail that will be obtained. The level of detail should be balanced with the amount and the quality of the data available.
- Model identification - In this step the model is fitted to the measured data. Usually, the error between model and reality is minimized.
- Model evaluation - The model is tested by means of special test data sets to determine whether the model has sufficient capacity to predict stationary and dynamic behavior. Most black-box models show good interpolation properties. However, their extrapolation qualities are mostly limited and should be tested if required.
Data preconditioning
After process analysis, data preconditioning is the first important step in the modeling process.
If faulty data is used or data is insufficiently rich in information, a poor model will result.
An important tool that can be used is principal component analysis. This tool gives a good indication whether there are points in the data sets which are abnormal. Bad data points should be removed from the data set.
It often happens that collected data is bad, for example in case of analyzer readings it may
happen that the gas chromatograph was out of service for some period of time. This situation
is similar to the situation where bad data points have to be removed.
Selection of Independent Model Variables
The first thing to do is to make a distinction between process input variables and state variables. The process input variables (usually flows) affect the process state variables (usually pressure, level, temperature, concentration) which in turn have an impact on the measured process output such as a quality (for example viscosity). One should then ask oneself the question: what part of the process do I want to model: the relationship between the process inputs and the quality variable or the relationship between the state variables and the measured quality variable or the relationship between the process inputs and the state variables? It would not be logical to model the quality variable as a function of both the process inputs and the state variables.
Furthermore, the state variables are usually not mutually independent, for example temperature, pressure and composition. This means that a model that includes the process inputs as well as the state variables as additional inputs, models the effect of the true process inputs multiple times. Therefore a proper selection of the inputs and output(s) of the model is very important.
An effective way to reduce the number the number of inputs is to transform the input space of mutually dependent variables into a space of independent variables by principal component analysis (PCA).
This space is usually much smaller. This transformation forms the basis of the construction of partial least square (PLS) models. Also in many packages for neural-network design a PCA toolbox is present.
Model Linearity
If possible, develop a linear model for any particular application with a low number of model parameters. Only if this is not possible, one should revert to other options.
It is good to realize that a fuzzy model that is used for non-linear modeling is a combination of several linear models. The non-linear relationship between the process inputs and the process output is fitted by a piecewise approximation of a series of linear models. This is attractive if one still would like to retain some feel for the actual meaning of the developed model. Often, for fuzzy models, one linear model is valid for a particular operating region. The overall fuzzy model combines the individual linear models in an intelligent way to describe the nonlinearity in the process. This is one of the reasons that fuzzy models generally provide more insight into the process than neural networks, which we consider to be true black box models.
Another efficient way of developing non-linear models is to include non-linear terms in the model. In order to do this in an effective way, one must have some understanding of the relationship between the input and output variables. If for example, one knows that a variable y depends on x and x2, then one could use x and x2 as basis variables and develop a linear relationship between y, x and x2. This is much more efficient than using y as an output variable and x as an input variable and develop a non-linear model by using, for example, fuzzy logic or neural network modeling.
Model extrapolation
Whether a model is good enough for the purpose it has been developed for, can be judged by testing the model on a validation data set that contains data that falls within the range of data the model was developed for. Another important issue is model extrapolation, the use of the model for a data set that contains data points that fall outside the region of the data test set. If the process behaves in a linear fashion and a linear model is developed one may expect the model to posses good extrapolation capabilities. If process behavior is non-linear and a fuzzy model was developed, one (linear) rule will be used for extrapolation of the model values outside the operating range.
A linear prediction of a strongly non-linear phenomenon will never be exactly correct.
A neural network, which is a true non-linear model description may provide better results, although this is not guaranteed. This will depend on many factors, such as the number of data points near the operating region boundary for which the model was developed and the non-linearity at the boundary which the model tries to fit versus the non-linearity of the phenomenon that is modeled.
Model evaluation
Model structure, as well as the model parameters, can be empirically identified from plant data. Structural considerations concern the order of the parametric model. The model is usually identified by minimizing the error between process data and model prediction.
For the process industries, in many instances low order models will suffice, especially when the process is kept in a certain operating point.
In most cases, part of the available data is used for model development, this data set is called the test set or training set, another part is used for model validation and is called the validation set. A criterion should be defined which indicates how good the model fit is.
More information about tools that can be used for black box modeling and big data analysis can be found in this article.