Measure of compactness for analyse dataset

. The article dedicated to finding measure of compactness, and analyze dataset with using compactness. For finding measure of compactness, we search for hidden regularities from sociology database. In this way we need consider n-p full task and it will be 2 𝑛 combinations. To solve this task, we use heuristic way with using binary matrix. In the article we use invariant methods which find measure of compactness of dataset. Our survey were taken from korean, uzbek peoples and we obtained a result from this dataset.


Introduction
Generalizing ability applies to the main indicators characterizing the quality of recognition algorithms [1].This ability is manifested in the ability to determine affiliation objects to classes that the algorithm did not see in the learning process.Hypothesis Verification compactness is the basis of many criteria and methods of the theory of pattern recognition.So, in [1] the compactness profile is described for calculating the generalizing ability of families of algorithms having of infinite capacity in the space VC (Vapnik -Chervonenkis) [2].For determining affiliation the arbitrariness of an arbitrary admissible object to classes when using such families of algorithms move must store the entire sample in memory.Representative of the family with infinite capacity is the "nearest neighbor" algorithm(BS).For practical purposes, when calculating the generalizing ability, it is enough to use local properties (local restrictions) of object samples [1].The local constraint in [3] can be considered the proposed N.G.Zagoruyko compactness index, determined by the number of objects-standards of minimum coverage, in which the recognition of class objects is fixed sampling was correct.In addition to the compactness indicator, candidates for inclusion in the set of local constraints are the number of noise objects, the dimension of the feature space, the set of objects around locks (subsets of boundary objects) of classes [4] by a given metric.Interest is the limiting value of dimension, above which the compactness index [3] increases.The set of features that defines the limit value is considered as informative for the proximity measure used.Dimension above the limit leads to erosion of the similarities between sampling objects.
There is a need to introduce a new measure of compactness with dimensional quantities with values in [0,1].The values of these quantities are required to analyze how the actual structure of the training sample is different(by a given measure of proximity) from the ideal for recognition.An ideal structure is considered in which the number of reference objects of the minimum coverage is equal to the number of classes.The measure of compactness can be used to compare metrics and transformations of attribute space in relation to "better" on fixed samples of objects.Analysis of the structure of samples is based on the use of the properties of this relationship.The analysis technique is focused on quantitative indicators calculated by the results of dividing class objects into disjoint groups [4].The guarantee of the uniqueness of the partition by the number of groups and the composition of their objects is the stability of the algorithm used.The influence of noise objects on the indicators of the generalizing ability of algorithms has been repeatedly considered in scientific publications.An extensive list of works in [5] provides an overview of various 38 methods for detecting and removing noise objects.Most of these methods are focused on use of the BS rule.The recognition quality according to the BS rule substantially depends on the sensitivity of the metric to the dimension of the attribute space.The change in dimension is associated with the selection of informative features, and with the transition to the description of objects in space from latent features.In [4,6], it was proposed to use two types of rules for hierarchical agglomerative grouping of initial features as a toolkit for transition to latent features.The first type is focused on the sequential combination of two signs into one by non-linear mapping of their values on the numerical axis.Grouping according to the rules of the second type is based on values of the stability criterion of objects by a given metric in a two class recognition problem.For each group of attributes, a generalized assessment of the object is calculated.Methods that implement two types of hierarchical grouping rules can be identified as non-linear and linear.Nonlinear methods are invariant to the scale of feature measurements.In linear methods, the invariance property is absent.The sequence of formation of groups and latent features based on them according to two types of rules determines the order by relation of the degree of informativeness.The informativeness of the feature is calculated as the extremum of the criterion dividing its (feature) values into disjoint intervals in the form of checking the degree of truth of a hypothesis: The sets of features values in the description of objects from different classes with the number of intervals equal to the number of classes do not intersect with each other.
In this paper, non-empty classes (sets) of metrics are considered, the cluster structures of training samples for which coincide (are equivalent) in the number of groups and the composition of the objects included in them.Information on the cluster structure allows sequential selection of objects of standards of minimum coverage, in each of which a local metric is defined.The method for calculating the weights of local metrics is similar to that used in the FRiS STOLP method [3].To evaluate the generalizing ability of the BS method algorithms, it is proposed to use a criterion, the values of which are calculated depending on the dimension and composition of the set of features, the number of removed noise objects and the number of objects of the minimum coverage standards.Assessment by criterion was used to demonstrate the sustainability of the selection of informative features with cross validation method on random samples.Additionally, in this article we will mention our results.

Methodology. On splitting class objects into disjoint groups
The use of a partially trained sample (PTS) for setting grouping conditions is described in [7].An example of a condition is the indication of a subset of pairs of sample objects that, when split, should not fall into one group.Belonging to disjoint classes serves a source of additional information for the study of cluster structure using various proximity measures.The main ideas of the method given below are presented in [4].The aims of splitting class objects into disjoint groups are: calculation and analysis of compactness values of class objects and the sample as a whole; search for the minimum coverage of the training sample with reference objects.
Vector F serves as additional information for specifying grouping conditions.When determining the minimum number of groups of connected and unrelated class objects, ( 0 , ) is used -a subset of the boundary objects (shell) of classes by the given metric and a description of objects in the new space from binary signs.To distinguish the class shell for each   ∈   ,  = 1, … , , a sequence ordered by  (, ) is constructed   0 ,   1 , . . .,   −1 ,   =   0 . ( Let    ∈   be the object closest to   from (1), not belonging to the class   .Denote by (  ) a neighborhood of radius   =  (  ,    ) centered in   that includes all objects for which (  ,   ) <   ,  = 1, . . .,  − 1.In (  ) there is always a nonempty subset of objects By (2), the belonging of objects to the class shell is defined as ( 0 , ) = ⋃   .

𝑚 𝑖=1
The set of shell objects from   ∩ ( 0 , ) is denoted by   ( 0 , ) = { 1 ,  2 , . . .,   },  ≥ 1.The value  = 1 uniquely determines the inclusion of all objects of the class in one group.For  ≥ 1 we transform the description of each object   ∈   ∨      = ( 1 ,  2 , . . .,   ) where Let, according to (3), a description of objects of the class   in a new (binary) attribute space be obtained,  =   , and  -is the number of disjoint groups of objects,   ⋁  ,   ⋀  -accordingly, the operations of disjunction and conjunction by binary signs of objects   ,   ∈   .The step-by-step algorithm for splitting objects   into disjoint groups  1 , …   is as follows.
Step 1:  = 0 Step 2: Select Object S ∊ Ω , θ = θ + 1, Z = S,   = ∅ Step3: Make a choice S=Ω end  ∧  = , Ω = Ω\S,   =   ∪ ,  =  ∨  till { ∈ Ω|S ∧  = } ≠ ∅ Step 4: if Ω ≠ ∅ then goto step 2 Step5: the end Partitioning of  0 objects into disjoint groups according to the algorithm described above is used to find the minimum coverage [4] of the training sample by objects by standards.Denote by   = (,  ̅ ) the distance from the object  ∈   to the nearest object  ̅ from the class ( ̅ ∈   ) opposite to   , by δ is the minimum number of disjoint groups of connected and unconnected objects of classes on  0 .
We order the objects of each group   ∩   ,  = 1, … ,  t=1,…,l according to the set of values {  } ∈  .As a measure of proximity between  ∈   ,  = 1, … ,  , and an arbitrary admissible object  ′ , weighted distance over the local metric (,  ′ ) = (,  ′ )/  .The decision on whether  ′ belongs to one of the classes  1 , …   is made according to the rule: According to the principle of sequential exclusion used in the search for coverage, the sample  0 is divided into two subsets: the set of standards   and the control set   ,  0 =   ∪   .At the beginning of the process   =  0 ,   = ∅ .Sorting by values from {  } ∈  ,  = 1, … , is used to determine the candidate for removal from the number of reference objects by the group   .The idea of selection consists to the find the minimum number of standards at which the recognition algorithm by (4) remains correct (without errors recognizing objects) on  0 .We assume that the numbering of groups of objects reflects the order | 1 | ≥ ⋯ ≥ |  | and the group   ,  = 1, … ,  δ did not select reference objects.  deletion candidates are sequentially selected starting from  ∈   with a minimum   value.If the inclusion of S in   violates the correctness of the decision rule (4), then S returns to the set   .

Realization of the concept. On compactness measures in recognition problems with a teacher
Compactness measures are claimed for evaluating the generalizing ability of recognition algorithms.When calculating the estimates, the results of the search and removal of noise objects, the selection of informative feature sets, the number of reference objects of the minimum coverage of training samples are used.Consider the method of forming a set of noise objects whose power depends on the verification of the conditions proposed below.
Let be   ∈   , (  ,   ) =    ∈  (  ,   ) and  = |{  ∈   | (  ,   ) < (  ,   )}|.Denote by   (  ∈   the set of noise objects of class   .The object   ∈   is included in   and is considered as noise if the condition is satisfied: .To analyze the results of dividing the class   into disjoint groups, taking into account their number, representativeness (by the number of objects) and removal of noise objects, it is proposed to use such a structural characteristic as compactness estimation: Obviously, the set of admissible values of Θ  by ( 6) lies in the interval [ ) noise objects excluded from consideration according to (5) as Values ( 6) and ( 7) indirectly indicate the homogeneity (heterogeneity) of the structure of the training sample.The closer the similarity of the groups in terms of the number of class objects included in them, the closer the value of (6) -to 1   , and (7)-to 1  .Obviously, the number and composition of noise objects depend both on the value of parameter λ in (5) and on sets of features in the description of objects.The problem of implementing computational procedures is to coordinate the selection of informative features and the removal of noise objects.Let the structure of class objects in the sample  0 be calculated by the grouping algorithm from above mentioned.We denote by Sh (λ, X (k)) the number of noise objects  0 determined depending on the value of λ according to (5) on the set of features X (k)  X (n), CF is the number of reference objects of the minimum coverage of the training sample, from which Sh (λ, X (k)) noise objects are removed.Since it is impossible to obtain an exact solution to the problem of selecting informative features without enumerating all their combinations taking into account the removal of noise objects, in practice it is recommended to use various heuristic methods.Regardless of the methods used, the quality of the selection of informative features is proposed to be determined by checking two conditions: when removing noise objects Sh (λ, X (k)) from E0, the indicator of the minimum coverage of the sample with reference objects tends to the maximum allowable value   ; product of the number of reference objects of the minimum coverage by the dimension of the attribute space The first condition ( 8) is necessary to assess the compactness of the coverage of the sample with standard objects, the second (9) -to assess the complexity of the calculations.
To search for informative sets {X (k) | X (k)  X (n)} two criteria are proposed.Both criteria clearly do not use the number of reference objects of the minimum CF coverage.The number of noise objects Sh (λ, X (k)) according to ( 5) is calculated by a fixed value of λ.Such λ for all sets X (k)  X (n), k ≥ 2, is defined as The use of ( 10) is based on the assumption that the probability of selecting informative feature sets with a higher compactness in ( 8) is close to zero for λ other than (10).In the first (in the order of presentation) criteria, the results of covering sample objects with hyperspheres are taken into account, taking into account the removal of noise objects, in the second, compactness estimates are used according to (7) based on the property of connectivity over objects of class shells.Let (  , ())(1 ≤  < ) be the neighborhood of the object,   ∈  0 ∩   ,  = 1, … , , defined as (  , ()) = { ∈   |(,   ̅ )}, where   ̅ ∈   is the object closest to   by the metric ρ (x, y) from the complement to the class   by the set of attributes X (k).We define the estimate   ∈  0 on X (k) as The where ( + 1) = () ∪ {  },  ⊂  0 .Let P denote the subset of feature indices from X (n);   (P) is the set of noise objects of class Kj according to (5) on the set {  } ∈ at the value λ, calculated according to (10).
Step-by-step selection of informative feature sets using (11) and ( 12) is implemented as follows.
Step Step 5: Output P.
Step 6: the end.
Denote by P the subset of feature indices from X (n);   () is the set of noise objects of class   according to (5) on the set {  } ∈ with value λ calculated according to (10).A step-by-step selection of informative feature sets using ( 11) and ( 12) is implemented .To compare informative sets obtained by different criteria, it is recommended to use ( 8) and ( 9).

Computational experiment
For a given training set  0 = { 1 , … ,   }, consisting of representatives l disjoint classes of objects .It is believed that each object   ∈  0 is described by n nominal features.We choose a one sociology dataset which determines nationality (uzbek or korean).The features of dataset are questions of surveys.The values of feature means the answers of the question.In our dataset there are 100 objects and 25 features.The last feature in our dataset is class of object.
We fix one metric from the equivalence class Ψ and assume that it is basic for a computational experiment.When calculating the measure of proximity between objects, the Zhuravlev metric was used as the base where ,  ⊂ {1, … , 20} -are sets of numbers of quantitative and nominal features, respectively.
Firstly, we determined shell objects, which is subset of boundary objects.As a result, 64 objects from 100 objects in dataset are taken as a shell object, after that, we built   binary matrix: In addition, i can say that, our binary matrix is symmetric as to main diagonal.After that, we grouped objects of classes.Finally, we found compactness of both classes: 0,8848 and 0,848 respectively.It can be seen that, compactness of classes are almost equal and good grade.As a result, our found