A DIALOG PRESENTATION OF CENSUS RESULTS BY MEANS OF THE - - PDF document

▶

Mar 13, 2023 209 likes •314 views

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/330141363 A DIALOG PRESENTATION OF CENSUS RESULTS BY MEANS OF THE PROBABILISTIC EXPERT SYSTEM PES Conference Paper April 1992

SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/330141363

A DIALOG PRESENTATION OF CENSUS RESULTS BY MEANS OF THE PROBABILISTIC EXPERT SYSTEM PES

Conference Paper · April 1992

CITATIONS

READS

1 author: Some of the authors of this publication are also working on these related projects: Statistical Recognition Based on Distribution Mixtures View project Probabilistic Neural Networks View project Jiří Grim The Czech Academy of Sciences

123 PUBLICATIONS 938 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jiří Grim on 04 January 2019.

The user has requested enhancement of the downloaded file.

SLIDE 2

A DIALOG PRESENTATION OF CENSUS RESULTS BY MEANS OF THE PROBABILISTIC EXPERT SYSTEM PES

JIˇ R´ I GRIM

Department of Computer Aided Decision-Making and Control Institute of Information Theory and Automation, Czechoslovak Academy of Sciences Pod vod´ arenskou vˇ eˇ z´ ı 4, 18208 PRAGUE 8, Czechoslovakia

ABSTRACT

This paper suggests a qualitatively new method of presentation of census results by means of the probabilistic expert system PES. The knowledge base of PES includes a statistical model of the original data which reproduces any marginal, conditional or unconditional, probability distribution with a reasonable accuracy depending on the corresponding subpopulation size. The resulting data compres- sion should enable an easy distribution of the final software product - possibly

n a single diskette.

1 Introduction

Extensive data sets become indispensable as a base of modern decision making in many spheres of social life. One of the most important data sets arises from census. It is unique by its extreme extent including the whole population and also by the corresponding extreme costs. With regard to its substantial meaning the census is periodically repeated in many countries despite of strong criticism from the point

f view of possible misuse of private data. On the other hand, because of necessary

data protection measures, the availability of information contained in census data is rather limited. In some respects the census yields global values like the number of citizens, but most questions of the census paper suggest a few alternative answers. The numerically answered questions may be discretized by introducing suitable intervals. The census paper may relate also to objects like households, buildings etc. in a complicated way. For this reason we assume in the following that the questionnaire relates to a single subject and the variables are transformed to discrete type. Possible

ther subjetcs of census are assumed to be treated separately on the base of the

corresponding subsets of variables.

011th European Meeting on Cybernetics and Systems Research ’92, p. 997-1004 , Eds: Trappl

R., World Scientific, (Singapore 1992), (Vienna, AT, 21.04.1992-24.04.1992) [1992]

SLIDE 3

For each question (variable) the fundamental result of census is given by the global relative frequencies of individual answers (values). The table of relative fre- quencies can be viewed as an estimate of the corresponding uncoditional marginal probability distribution and displayed as a histogram. To estimate conditional prob- ability distributions we can compute analogous histograms for different subpopula- tions, e.g. men, women, nationalities, regions etc. Generally, any combination of answers of different questions can be used as a criterion (condition) for the choice of subpopulation which, in turn, may be characterized by histograms of the remaining, unspecified questions. Any histogram (conditional distribution) characterizing such a subpopulation formally corresponds to a possible user query and can be estimated by means of relative frequencies - with some accuracy depending on the subpopula- tion size. The number of possible queries, which can be formulated over a census ques- tionnaire may become exceedingly large. Thus e.g. in case of 100 questions with 4 alternative answers we can consider 100 unconditional ”global” histograms, 39600 conditional histograms corresponding to subpopulations specified by one answer and 125126400 conditional histograms for subpopulations specified by two answers. As the subpopulations specified by three or four answers may still be quite plausible (even if partly empty), it is obvious that only small portion of the potentially inte- resting histograms could be printed. The problem of accessibility of census results can hardly be solved by some kind

f choice since potential users may formulate very special and diverse queries. One
f the present approaches relies upon technical means. Aggregate results in form of

contingency tables are stored in a high capacity memory which is accessible from a computer network. However, even in this case only small order tables (e.g. 6-10 variables) can be stored because of technical limitations. Thus, any member of the network may directly recall any conditional histogram as long as the number of involved variables does not exceed the size of the corresponding table. Despite of technical problems and possibly high costs this ”direct” approach could retain its meaning in the future. An alternative solution enables the recently developed probabilistic expert sys- tem PES 2,3. Instead of contingency tables the census results are described in a highly compressed form by a multivariate discrete probability distribution which is included in the knowledge base. Using PES the customer can compute the estimate

f any conditional or unconditional histogram without any further contact with the

central data base.

2 The probabilistic expert system PES

Considering the probabilistic approach to expert systems we assume that input and

utput information is expressed in terms of some discrete (finite valued) random

SLIDE 4

variables v1, v2, . . . , vN ; vn ∈ Xn. (1) The certainty degree (truth value, degree of belief) that a random variable vn has a value xn ∈ Xn is assumed to be given by the corresponding probability P{vn = xn}. In this sense the uncertainty of a variable vn is characterized by a discrete probability distribution P{vn = xn} = pn(xn) ; xn ∈ Xn ;

∑

xn∈Xn

pn(xn) = 1 (2) which can be viewed as a histogram having |Xn| columns. Because of its informa- tivity and simplicity the concept of histogram represents the fundamental commu- nication means between the user and ES. The purpose of ES usually consists in evaluating some output variables vk+1, vk+2, . . . , vN (goals) given a knowledge base and some input variables v1, v2, . . . , vk (questions). From a more general probabilistic point of view, we have to derive the distribution of the output vector vB = (vk+1, vk+2, . . . , vN) P{vB = xB} = P ⋆(xB) ; xB = (xk+1, xk+2, . . . , xN) ∈ XB ; (3) XB = Xk+1 × Xk+2 × . . . × XN which corresponds to a given distribution of the input vector vA = (v1, v2, . . . , vk) : P{vA = xA} = P ⋆(xA) ; xA = (xk+1, xk+2, . . . , xN) ∈ XA ; (4) XA = X1 × X2 × . . . × Xk . In this case the probabilistic knowledge base is fully described by the system of conditional distributions on XB ΠB|A = {PB|A(xB|xA) ; xB ∈ XB ; xA ∈ XA} (5) since for any input distribution P ⋆

A(xA) we can write

P ⋆

B(xB) =

∑

xA∈XA

PB|A(xB|xA) P ⋆

A(xA) ;

xB ∈ XB (6) The system of conditional distributions (5) defines s.c. memoryless information channel with noise and the formula of complete probability (6) represents an exact inference mechanism for the considered problem. Obviously, for another choice of input and output variables we would need other classes of conditional distributions. To avoid a difficult direct design of the conditional distributions (5) the know- ledge base of PES is defined as a joint probability distribution of the involved vari- ables - in form of a finite mixture (weighted sum) of product components P(x) =

∑

m=1

wmF(x| m) ; F(x| m) =

∏

n=1

pn(xn| m) ; (7)

SLIDE 5

∑

wm = 1 ; X = X1 × X2 × . . . × XN . Here wm ≥ 0 is the apriori weight of the m -th component and pn(xn| m) is a conditional discrete distribution of the variable vn. Let us note that this class of distributions is complete in the sense that any multivariate discrete distribution can be expressed in the form (7). An important feature of the finite mixture (7) is a simple switch-over to any marginal distribution by deleting superfluous terms in the products F(x| m). If we denote PA(xA), PB(xB) the marginal distributions of the mixture (7) corresponding to the input and output vector vA, vB respectively, we can write P(xA) =

∑

m=1

wmF(xA| m) ; F(xA| m) =

∏

n=1

pn(xn| m) ; (8) P(xB) =

∑

m=1

wmF(xB| m) ; F(xB| m) =

∏

n=k+1

pn(xn| m). (9) Consequently, for any choice of the input and output variables, the conditional distributions (5) may be computed by the formula PB|A(xB|xA) = P(x) PA(xA) =

∑

m=1

Wm(xA) FB(xB| m) ; (PA(xA) > 0) (10) where Wm(xA) are the component weights corresponding to a given input vector xA ∈ XA : Wm(xA) = wmF(xA| m)

∑M

j=1 wjF(xA|j) ;

m = 1, 2, . . . , M . (11) The conditional distribution (10) corresponds to the definite input vA = xA ∈ XA. In case of uncertain input represented by a distribution P ⋆

A(xA) (generally P ⋆ A(xA) ̸=

PA(xA)) we have to make the substitution (10) in the formula of complete probability (6): P ⋆

B(xB) =

∑

xA∈XA

PB|A(xB|xA) P ⋆

A(xA) = M

∑

m=1

W ⋆

m FB(xB)

(12) whereby W ⋆

m =

∑

xA∈XA

Wm(xA)P ⋆

A(xA) .

(13) Again, any marginal output distribution is easily obtained by reducing the compo- nent products FB(xB| m) in Eqs. (10), (12).

3 Estimation of the knowledge base of PES from data

In practice the components of the mixture (7) may correspond e.g. to different situations, or diagnoses and can be designed in cooperation with experts. This

SLIDE 6

approach was applied to a problem of perinatological screening 4,5. An alternative possibility is to compute the knowledge base directly from data by means of the EM

algorithm. This is of particular importance in case of census.

We shall assume that the census data S = {x(1), x(2), . . . , x(K)} ; x(k) ∈ X (14) can be interpreted as a set of independent realizations of the random vector v with some unknown distribution of the form (7). To estimate the unknown parameters W,P W = {w1, w2, . . . , wM}; P = {pn(⋆|m); n = 1, 2, . . . , N ; m = 1, 2, . . . , M ; } (15) the likelihood function Q(W, P) = 1 K

∑

k=1

log [

∑

m=1

wmF(x(k)| m) ] (16) can be maximized by means of the iterative EM algorithm 1 which is characterized by the following basic iteration equations: q(t)(m| x) = w(t)

∏N

n=1 p(t) n (xn| m)

∑m

j=1 w(t) j

∏N

n=1 p(t) n (xn| j)

; m = 1, 2, . . . , M; x ∈ S; (17) w(t+1)

= 1 |S|

∑

x∈S

q(t)(m| x); m = 1, 2, . . . , M; (18) p(t+1)

(ξ| m) = 1

∑

x∈S q(t)(m| x)

∑

x∈S

δ(ξ, xn) q(t)(m| x); ξ ∈ Xn; (19) m = 1, 2, . . . , M; n = 1, 2, . . . , N; Here the delta-function δ(ξ, xn) is equal to 1 if ξ = xn and zero otherwice. The EM algorithm converges monotoneously to a local maximum of the function Q in the sense that the corresponding sequence of values Q(t) is nondecreasing. Let us remark that the number of components M is a fixed input parameter

f the EM algorithm and has to be optimized by other means. There arise also

several other computational problems - partly reported by different authors in the

literature. First, the algorithm should be applicable to incomplete data (cf.

1) to

enable the processing of questionnaires with missing answers. Further, there is a problem of proper choice of initial values of parameters, slow convergence of the algorithm in final stages of computation, the problem of existence of local maxima and others. Most of these problems have been solved succesfully in a recent application of PES to opinion poll data. Despite of relatively small sample sizes which are typical for the public opinion research (1500-2500 respondents and 50 - 200 questions), the accuracy of the computed histograms was very satisfactory. In case of census there arise additional problems invoked by the extreme size of the underlying data set.

SLIDE 7

4 Tools of information analysis

The probabilistic expert system PES provides some tools, analogous to explaining mechanisms in certain sense, which enable the insight into the knowledge base and facilitate the information analysis of variables. The most simple but very useful tool is the comparison of histograms. In a view of difficult displaying the (conditional) multivariate distributions are characterized by the set of corresponding univariate marginals-histograms. Comparing the histograms of different conditional distribu- tions we can get quickly an opinion of the underlying differences. To measure the dissimilarity of histograms we can use e.g. the formula D0 = 1 N

∑

n=1

∑

xn∈Xn

1 2| Pn(xn) − P ′

n(xn)| ;

(0 ≤ D0 ≤ 1) . (20) This function can be used e.g. to order the variables according to their differentiating meaning. Another useful tool arises from the interpretation of the knowledge base in the sense of cluster analysis. The meaning of the estimated components is not apriori given, but may be additionaly interpreted to define natural clusters in the original

data. In this sense the EM algorithm can be viewed as a well based method of cluster
analysis. By introducing a new artificial variable the results of cluster analysis can

be made available to the user. To characterize mutual overlap of the underlying component distributions we can use the Bhattacharyya distance B(m1, m2) =

∑

x∈X

√

q(m1| x)q(m2| x) P(x) ; q(m| x) = wmF(x| m) P(x) ; (21) In multidimensional cases the evaluation of this formula is very tedious for obvious

reasons. In case of product components it can be transformed to the following simple

form B(m1, m2) = √wm1, wm2

∏

n=1

{

∑

xn∈Xn

√

pn(xn| m1)pn(xn| m2)} (22) In order to clarify dependences between the variables we can evaluate mutual in- formation between an arbitrarily chosen reference variable vr and any other variable

vn. The computation of informativity of variables is based on the Shannon formula

I(Xn, Xr) = H(Xn) − H(Xn| Xr) (23) where H(Xn), H(Xn| Xr) are the respective uncoditional and conditional entropies.

5 Accuracy of the knowledge base

The estimated knowledge base of PES is the source of different marginal distributions

conditional or unconditional. As the true form of these distributions is unknown,

SLIDE 8

we can only compare the distributions derived from the statistical model with the corresponding contingency tables. In case of opinion poll any histogram can be compared with the true relative frequencies since the original data set can be placed on the distribution diskette. Such a comparison is not possible in case of census data for obvious reasons. Never- theless, we can estimate the size of the related subpopulation and compute some confidency intervals. A warning can also be issued when the subpopulation size is less than a suitably chosen threshold value. There are several off-line possibilities to characterize the accuracy of the esti- mated distributions globally. First, the mean absolut error of unconditional univari- ate marginal distributions (unconditional histograms) can be computed according to the Eq. (20). An analogous formula can be used for any conditional histograms but, immediately, there is a problem of suitable choice of the condition. Considering only the ”first order” conditions we could compute the average value over all variables D1 = 1 N

∑

r=1

∑

xr∈Xr

P ⋆

r (xr)[

1 (N − 1)

∑

n=1;n̸=r

∑

xn∈Xn

1 2|Pn|r(xn|xr) − P ⋆

n|r(xn|xr)|] ; (24)

whereby the asterisk denotes the contingency table values. However, an analogous average for ”second order” or ”higher order” conditions would be probably compu- tationaly unfeasible. As the higher order conditions involving several variables are still of importance, the Monte Carlo method or some kind of upper bounds could also be of interest to characterize the accuracy of the statistical model empirically.

6 Conclusion

It appears that, compared with rule-based systems, the probabilistic expert sys- tems provide two major advantages: a more reliable information processing and the theoretically well justified possibility of estimating the knowledge base from data. However, the exact inference mechanism is of little value if the knowledge base itself is not designed exactly. Thus, the application area of probabilistic expert systems logically moves towards large data bases with completely unknown relationship of

variables. In these cases the expert system could enable the experts to get insight

into the data and to formulate some conclusions, if desirable. Recently this idea was applied with promising results to opinion poll data. The presentation of census results by means of the probabilistic expert system PES is based on a highly compressed statistical modell of the original data. The knowledge base of PES is represented by an estimate of the joint probability dis- tribution of involved variables - in form of a finite mixture of product components. The properties of these mixtures enable a simple computation of any conditional or unconditional marginal distribution. Naturally, the main purpose of the expert sytem based on the statistical model of some data set is to facilitate the information analysis of the original data. A decision-

SLIDE 9

riented application is not excluded but not of primary importance. This change
f orientation had also some influence on the interactive tools of the probabilistic

expert system PES. Instead of explaining mechanism the dependence between vari- ables is of interest, measured in terms of Shannon informativity. Three types of informativity of variables can be computed to clarify their relation to an arbitra- rily chosen reference variable whereby the list of variables is automatically ordered according to the currently obtained informativity values. As very useful appears a simple comparison of histograms for different conditional distributions, the results

f cluster analysis and others.

The class of the considered finite mixtures is complete in the sense that any multivariate discrete distribution can be expressed in this form. The achievable accuracy of estimates is therefore the main limiting feature of the suggested method. In case of census the large data set should enable a highly accurate estimation of the distribution mixture but, on the other hand, the computation could become exceedingly time consuming. The final software product is supposed to be placed and distributed on a single diskette. REFERENCES 1. A.P. Dempster and N.M. Laird and D.B. Rubin,“Maximum likelihood from incomplete data via the EM algorithm”, J.Roy.Statist.Soc. B 39, 1-38 (1977) 2.

J. Grim,“A probabilistic expert system”, Presented at the

International Conference SofttStat’89, Heidelberg, 3. - 6. 4. 1989 3.

J. Grim, “Probabilistic expert systems and distribution mixtures”, Computers

and Artificial Intelligence, 9, No.3, 241-256 (1990) 4.

J. Grim and I. Reich, “Test for early recognition of risks during pregnancy

based on the probabilistic expert system PES”, Seminar of the Czechoslovak Cybernetic Society, Prague, Emauzy, 15. 3. 1990 5.

J. Grim and I. Reich, “Test for early recognition of risks during pregnancy

based on the probabilistic expert system PES”, Software demonstration at the Nationwide Conference on Application of Computers in Gynaecology, Bratislava, 3. - 4. 5. 1990.

View publication stats View publication stats