Compositional data

In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Measurements involving probabilities, proportions, percentages or ppm can all be thought of as compositional data.

The original definition, given by John Aitchison (1986) has several consequences:

A compositional data point, or composition for short, can be represented by a positive real vector with as many parts as considered. Sometimes, if the total amount is fixed and known, one component of the vector can be omitted.
As compositions only carry relative information, the only information is given by the ratios between components. Consequently, a composition multiplied by any positive constant contains the same information as the former. Therefore, proportional positive vectors are equivalent when considered as compositions.

Compositional data can be represented by constant sum real vectors with positive components, and this vectors span a simplex, defined as

{\mathcal {S}}^{D}=\left\{\mathbf {x} =[x_{1},x_{2},\dots ,x_{D}]\in \mathbb {R} ^{D}\,\left|\,x_{i}>0,i=1,2,\dots ,D;\sum _{i=1}^{D}x_{i}=\kappa \right.\right\}.\

An illustration of the Aitchison simplex. Here, there are 3 parts,

x_{1},x_{2},x_{3}

represent values of different proportions. A, B, C, D and E are 5 different compositions within the simplex. A, B and C are all equivalent and D and E are equivalent.

The sample space $\scriptstyle {\mathcal {S}}^{D}$ is also known as the Aitchison simplex. It turns out that an alternative vector space structure can be defined on the Aitchison simplex, which motivated the development of Aitchison geometry.

Each composition represents an equivalence class. Any two compositions $x,y\in S^{D}$ are said to be equivalent if $y=\lambda x$ for any $\lambda >0$ . For example, if these two compositions where $x=[0.5,0.25,0.25]$ and $y=[50,25,25]$ , they are equivalent since one could multiply $x$ by 100 to obtain $y$ .

Equivalent compositions can be represented by positive vectors whose components add to a given constant $\scriptstyle \kappa$ . The vector operation assigning the constant sum representative is called closure and is denoted by $\scriptstyle {\mathcal {C}}[\,\cdot \,]$ :

{\mathcal {C}}[x_{1},x_{2},\dots ,x_{D}]=\left[{\frac {x_{1}}{\sum _{{i=1}}^{D}x_{i}}},{\frac {x_{2}}{\sum _{{i=1}}^{D}x_{i}}},\dots ,{\frac {x_{D}}{\sum _{{i=1}}^{D}x_{i}}}\right],\

where D is the number of parts (components) and $[\cdot ]$ denotes a row vector.

Examples

Each data point may correspond to a rock composed of three different minerals; a rock of which 10% is the first mineral, 30% is the second, and the remaining 60% is the third would correspond to the triple [0.1, 0.3, 0.6]; a data set would contain one such triple for each rock in a sample of rocks.
Each data point may correspond to a town; a town in which 35% of the people are Christians, 55% are Muslims, 6% are Jews, and the remaining 4% are others would correspond to the quadruple [0.35, 0.55, 0.06, 0.04]; a data set would correspond to a list of towns.
In chemistry, compositions can be expressed as molar concentrations of each component. As the sum of all concentrations is not determined, the whole composition of D parts is needed and thus expressed as a vector of D molar concentrations. These compositions can be translated into weight per cent multiplying each component by the appropriated constant.
In a survey, the proportions of people positively answering some different items can be expressed as percentages. As the total amount is identified as 100, the compositional vector of D components can be defined using only D − 1 components, assuming that the remaining component is the percentage needed for the whole vector to add to 100.
In probability and statistics, a partition of the sampling space into disjoint events is described by the probabilities assigned to such events. The vector of D probabilities can be considered as a composition of D parts. As they add to one, one probability can be suppressed and the composition is completely determined.
In high throughput sequencing, data obtained are count compositions since the capacity of the machine determines the number of reads observed. These reduce to probabilities of observing a feature given the sequencing depth.

References

Aitchison J. (1986), The Statistical Analysis of Compositional Data, Chapman & Hall; reprinted in 2003, with additional material, by The Blackburn Press.
van den Boogaart K. G., Tolosana-Delgado R. (2013), Analyzing Compositional Data With R, Springer.
Pawlowsky-Glahn V., Egozcue J. J., Tolosana-Delgado R. (2015), Modeling and Analysis of Compositional Data, Wiley.

Software

compositions- R package for Compositional Data Analysis
coda.base- Compositional Data Analysis in R
CoDa.jl - Compositional Data Analysis in Julia

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.