1. Causal structure in data

 

Suppose that (x1, x2, … , xJ) are random variables with a covariance matrix G. Suppose further that they depend on the variables (t1, t2, …) in a following way,

 

                        x1 = p11 t1 + p12 t2 + …

                        x2 = p21 t1 + p22 t2 + …

(1)                    …

                        xJ  = pJ1 t1 + pJ2 t2 + …

 

In a vector/matrix notation (1) can be written as x=Pt. (When working with data it corresponds to the rows of X=TP¢). If the t-variables are mutually uncorrelated, the equations (1) imply that we can write G=PDPN, where D is a diagonal matrix containing the variances of the t's.  The number of t-variables equals the amount of columns in P. This amounts to stating that the score vectors (ti) in (1) are mutually orthogonal, tiNtj=0 for i¹j. (Note that this D may be different to the D in (1). D in (1) is a scaling matrix that depends on the procedure and the actual choice of scaling). The coefficients in P (K´A) tell us how the variables are related. Following table shows three examples.

 

 

 

Ex. l

 

 

 

Ex. 2

 

 

Ex.

3

x

t1

t2

t3

 

t1

t2

t3

 

t1

t2

x1

x

x

 

 

x

x

 

 

x

 

x2

x

x

 

 

x

x

 

 

x

 

x3

x

x

 

 

 

x

x

 

x

 

x4

x

 

x

 

 

x

x

 

 

x

x5

x

 

x

 

 

x

x

 

 

x

x6

x

 

x

 

x

 

x

 

 

x

Table 1. Schematic illustration of sizes of loading coefficients, P.

 

Cells in the table marked by x indicate that the corresponding loading coefficient is different from zero. The unfilled part thus consists of zeros. In the first example there are only non-zero loadings for the first latent variable. The variables x1-x3 have non-zero loadings on the second latent variable, while x4-x6 on the third. We see this type, when we work with psychometric data. The first latent variable explains all the variables, while the variables can be grouped in relation to the other latent variables. In the second example each of the ten variables can be explained by two latent variables. In the third examples the first latent variable explains x1-x3, while the second explains x4-x6. It indicates that the two sets of variables are mutually independent.

            The non-zero coefficients can be generated as the ones that are not statistically significant. We can also prescribe certain coefficients to be non-zero and estimate the loading in relation to this pattern of non-zero loading values. In this case we study how well the loading pattern describes the present variables.

            The latent variables are derived as linear combinations of the original variables as shown in the following set of equations,

 

                        t1 = r11 x1 + r21 x2 + … + rJ1 xJ

(7)                    t2 = r12 x1 + r22 x2 + … + rJ2 xJ

                       

 

If the t's are uncorrelated, the equations (7) imply that D=RNGR, where D is the diagonal matrix of the variances of the t's. The equations (7) are useful, when you want to study the independence and conditional independence among the x-variables.

            Sometimes we require a special structure in the coefficient matrices P or R. An example is when we require P to be lower triangular. In that case the equations (1) can be written as

 

                        x1 = p11 t1

                        x2 = p21 t1 + p22 t2

x3 = p31 t1 + p32 t2 + p33 t3

(8)                                      

                        xJ  = pJ1 t1 + pJ2 t2 + …

 

This makes interpretation of the loadings simpler. E.g., x1 is the same as t1, x2 depends on x1 and x2, and so on. Also, e.g., p33 can be interpreted as the part of x3 that cannot be described by x1 and x2. When P is lower triangular, R will also be.

            The presentation in this section has been based on one block of variables (x1,x2, …) or one block of data X that is being described by latent variables (t1,t2, …) or one block of score data, T. But we will be working with several blocks of X-data. The same considerations (e.g. patterns of non-zeros) or special structure (e.g. lower triangular loading matrix) apply to each block.

 

 

 

2 Measures of fit and changes

 

The H-principle and PLS are based on least squares approach. In contrast many statistical procedures use maximum likelihood, ML, as the estimation method. We shall briefly review the ML approach, because the way we work with J-divergence (see below) is similar to the ML approach. In the ML approach the log-likelihood function is often used to measure fit and changes in parameter estimates. It is given by

 

(3)                    N(S) = c - N (log (|S| + tr(S-1S) )/2,

 

where c is a constant and S the sample covariance matrix, S=XTX/N. From a theoretical point of view it is comfortable to work with the function (3). If S1 is the maximum likelihood estimate for S under null-hypothesis and S2 the estimate under a (reduced) hypothesis, the theory tells us that the difference

 

(4)                    2(N(S2) - N(S1)) = - N (log(|S2|/|S1|) + tr ((S1-1 - S2-1)S))

 

approximately will follow a x2-distribution with degrees of freedom equal to the difference between the number of parameters at the two hypothesis. (4) is often used to test the effect of adding the number of parameters by 1 and comparing (4) to the critical value of x2-distribution with one degree of freedom. The disadvantage of using this procedure is that the function (3) is unstable, when data has reduced rank. It is easy to show examples, where arbitrary small score vectors become significant, when using (3) or (4), although they have no predictive ability. It is instructive to look closer at the differential of (3),

 

(11)      dN = N tr( S-1 - S-1S S-1)d S

 

The equation shows that optimal S is one, where S is a generalised inverse of S-1, S-1 = S-1S S-1.

The equation (5) shows that S-1S should be close to the identity matrix I. When we work with reduced rank, we also need that SS-1 is close to I. This can be obtained by working with J-divergence,

 

(12)      J = tr( S S-1S-1S - 2 I)/2

 

The theory associated with (6) is the same or similar to (3) and (4). It is derived from testing equality of two covariance matrices, see ref 10. It is instructive to look at (6) in terms of the eigen values of S S-1. If (li) are the eigen values, it is simple to show that

(7)                    J = ½ S (1 - li)2/li

 

The equation (7) shows that the ratio of variances from S and S-1 should be both close to one and different from zero, if J in (7) is to be small. The differential of J is given by

 

(8)      dJ = ½ tr( S-1 - S-1S S-1)d S

 

If we compare (5) and (8) we see that (8) will be close to (5), when the inverse S-1 is close to S-1. But the advantage of working with (6) and (8) is that the expressions can be used also in the case of reduced rank in data. But it should be noted, that the theory is different in the case of reduced rank. Furthermore, we work with the expressions in the case of reduced rank in the same way as in case of full rank. Thus, the criteria we use are well motivated, when data are not of full rank.

 

 

3 Sequential estimation of loading coefficients

 

We shall briefly describe the situation, when revising estimates of the loading matrix. The algorithm has given a loading matrix P=P0, such that the sample covariance matrix S is approximately S@P0P0T. Here we suppose that the score vectors have been scaled to unit length. The algorithms associated with the H-principle always provides with a matrix R=R0 such that R0TP0= I, where I is the identity matrix with number of diagonal elements equal to the number of score vectors. The task we consider here is to find a simple or a significant form of the loading matrix P. The matrix P that we find may contain only significant non-zeros. It may also contain the best estimates according to some pattern of non-zeros, cf. Table 1. Or it may contain a combination of these two, i.e., significant loading values in a given pattern of non-zero loading values.

            We shall use the following notation:

 

            S=P0P0T,        S+=R0R0T,     S=PPT,           S+=RRT.

 

The matrix R is computed as R=P(PTP)-1. Note that the number of columns in P can be smaller than the number of columns in P0. There are three steps in filling out P:

 

1.      Find the next element in P that should be considered as non-zero.

2.      Estimate the values of all non-zero elements in P.

3.      Judge the improvement and closeness to P0.

 

There are many ways to handle each of the three steps. We shall here only consider one approach of each.

 

Step 1. We are to find a new loading matrix P1, P1=P + c Eij, where Eij has 1 as element (i,j) and zero for others. It is natural to choose the next non-zero element that gives the maximal increase in [tr(P1P1TS+)-tr(PPTS+)]. If we solve the equation [tr(P1P1TS+)-tr(PPTS+)]/c=0, it is shown in Appendix 1 that a solution is given by

 

(9)      c = - pjTui/(uii)

 

Here pj is the j-th column of P and ui the i-th column of U=S+=(uij). It gives an optimal increase as,

 

(10)      [tr(P1P1TS+)-tr(PPTS+)] = (pjTui)2/(uii)

 

The task is thus to find the element in P that gives the maximal value of

 

(11)                  max  (pjTui)2/(uii),        for (i,j) within the pattern of actual or required non-zeros.

 

Step 2. Previous step gave a candidate for the next non-zero element. We need to revise the estimates of present non-zero elements in the light of the new one. If we differentiate J with respect to an element in P, we get from (8)

 

                        J/pij = ½ tr( S-1 - S-1S S-1)(ei pjT + pj eiT)

 

Here we need an expression for S/pij. From S =PPT=p1 p1T + p2 p2T + …, we derive S/pij =ei pjT + pj eiT. Here ei=(0,.,0,1,0,.), where the 1 is at the i-th element. If we require J/pij=0, we get

 

tr( S-1 pj eiT) = tr(S-1S S-1 pj eiT) =  tr(S-1S RRT pj eiT) = tr(S-1S R ej eiT)

 

This gives

 

                        uiT pj = (S-1S R)ij                   ( the (i,j)-th element of S-1S R).

 

If we write out the linear equations, we get

 

(12)                 u1i p1j + u2i p2j + ... + uJi pJj = (S-1S R)ij

 

The equation (12) is used to find pj. Some of the p-values in (12) can be zero and therefore are not in the equations. There is one equation for each non-zero element in pj. Therefore (12) will give us a quadratic coefficient matrix that is positive definite. The part of the coefficient matrix U=S+ that is used in (12) will typically be of reduced rank. Therefore, some care must be given in finding the solution pj. Note that both R=P(PTP)-1 and E-1=RRT depend on P. Thus, some few iterations are necessary to get convergence. The starting values are the previous values of non-zero elements in P and the value c in (9) for the estimate of the last non-zero one. The task of (12) is to provide with a small adjustment of the values of the non-zero elements.

 

Step 3. The starting point is the expression for the J-divergence, (8), that we need to truncate to fit data with reduced rank. The products S S-1 and S-1S  are both positive definite. They can therefore be written as CCT and CT-1C-1, resp. It gives

 

                        J = tr( S S-1S-1S - 2 I)/2 = tr( (C - CT-1)(C - CT-1)T)/2

 

When we truncate, we use C as C@RTP0 og CT-1@(R0TP)T. It gives

(13)                 J @ tr( (RTP0 - (R0TP)T) (RTP0 - (R0TP)T)T)/2 = S fij2/2,

 

where F=RTP0 - (R0TP)T. Both RTP0 and (R0TP)T= PTR0 approach the unity matrix of appropriate size, when P is filled out (and R computed as R=P(PTP)-1). The filling out of P stops, when J is sufficiently small or F can be judged as a random matrix.