Prioritizing and annotating the dimensions of input data is part of the optional guidance step for each analysis, as introduced in Working with Inspirient > Providing Guidance. For both dimension priorities and annotations, suggested values are provided by the system based on its current understanding of the dataset. In most cases, users only need to review and possibly tweak these suggestions. If an analysis is started via the I’m feeling lucky button, all suggestions are used directly without reconfirming them with the user.
Prioritization of Input Dimensions
Prioritization affects the sorting order of results, so that results that are derived from higher-priority dimensions are more prominently displayed among results and more likely to be included in stories.
The exact effects of low / high priorities are as follows, with the objective of ensuring that overall comprehensiveness across all analytical methods degrades gracefully for very large datasets:
- Dimensions set to the lowest priority setting are ignored during the analysis
- Dimensions set to the highest priority setting are analysed preferentially, i.e., they are guaranteed to be evaluated by all applicable analytical methods
- Other low-priority dimensions may be omitted from certain analytical methods if otherwise computational requirements would exceed allocated resources.
Client-specific and hardware-specific options may be set by system administrators to fine-tune system behavior.
Contextualization of Input Dimensions
Annotations allow users to establish the analytical context of input dimensions, for example, to specify whether calculating the sum over a column of numeric values is sensible (e.g., for inventory quantities) or not (e.g., for time-series measurements).
There are four kinds of data annotations:
- Filter – Annotations that perform various filters on the input data
- Transformation – Annotations used to carry out a transformation on the input values
- Semantic – Annotations to explicitly communicate column meaning
- Analysis – Annotations that affect the analysis
The full list of supported annotations and their effects are listed in the following table.
Annotation | Type | Description | Details |
---|---|---|---|
FILTER_ON |
Filter | Filter input table with a given criteria (regular expression accepted) | |
FILTER_ON_DOMINANT_DOMAIN |
Filter | Filter table on the most frequent items (accounting for 80% of occurrences) | |
FILTER_ON_DOMINANT_DOMAIN_BY_VALUE |
Filter | Filter table on items with largest sum of a given value column (accounting for 80% of accumulated value) | |
FILTER_ON_TOP_3_BY_VALUE |
Filter | Filter table on the 3 items with the largest sum of a given value column | |
FILTER_ON_TOP_10_BY_VALUE |
Filter | Filter table on the 10 items with the largest sum of a given value column | |
ABC_CLASSIFICATION |
Transformation | Classify column into n categories using ABC analysis | |
ANONYMIZE |
Transformation | Anonymize items in column with a generated ID value | Annotates a dimension to be anonymized by replacing it with a new column that contains a cryptographically strong hash value for each original value. A look-up table to map hashed values to original values is made available separately to the user who owns the analysis. |
DEEP_DRILL_DOWN |
Transformation | Split input table on each item in column | |
DEFAULT_VALUE |
Transformation | Transform missing values, i.e., absent or null, to a default value | Annotates a dimension to use a given default value in case no value is present, e.g., {DEFAULT_VALUE(no data available)} |
DEFINE_AS_MISSING |
Transformation | Define a value to be treated as missing during analysis | Annotates a dimension to define a value to be treated as missing, e.g., {DEFINE_AS_MISSING(-1:not applicable)} |
DRILL_DOWN |
Transformation | Split input table on the most frequent items (accounting for 80% of occurrences) | |
IGNORE_VALUE |
Transformation | Ignore specified value(s), i.e., treat as absent or null | Annotates a dimension to ignore certain values by excluding matching rows during analysis, e.g., {IGNORE_VALUE(John Doe)} |
JOINABLE_ID_VALUES |
Transformation | Enable joining tables on specified ID values | Annotates a dimension to specify that it can be used to join tables with a corresponding JOINABLE_ID_VALUES annotation |
USE_AS_IS |
Transformation | Disable any automated transformations during analysis for this column | |
CATEGORICAL |
Semantic | Values representing categorical items | Annotates a dimension as containing categorical values without a natural order (also see ORDINAL ) |
DEMOGRAPHIC_VARIABLE |
Semantic | Socio-demographic information | Annotates a dimension to be treated as a socio-demographic variable, e.g., when analyzing survey data |
HAS_SUBTOTALS |
Semantic | Numeric column contains subtotals | Annotates a dimension as including subtotals |
ID |
Semantic | Values representing ID values | Annotates a dimension to be treated as ID values |
IS |
Semantic | Values representing selected meaning | |
LESS_IS_BETTER |
Semantic | A lower numeric value is better | Annotates a dimension to contain numeric values for which lower values are more desirable in the context of the current analysis |
MAXIMIZABLE |
Semantic | Numeric values where the maximum should be considered for all numeric operations | |
MINIMIZABLE |
Semantic | Numeric values where the minimum should be considered for all numeric operations | |
MORE_IS_BETTER |
Semantic | A higher numeric value is better | Annotates a dimension to contain numeric values for which greater values are more desirable in the context of the current analysis |
NATURAL_LANGUAGE_TEXT |
Semantic | Text values that should be considered for natural language processing | Annotates a dimension to be treated as natural language text and applying corresponding analytical methods |
NOT_CATEGORICAL |
Semantic | Values that should be considered for all numeric operations | Annotates a dimension as not containing categorical values |
NOT_SUMMABLE |
Semantic | Numeric values that cannot be summed | Annotates a dimension as not containing summable values |
ORDINAL |
Semantic | Numeric values representing ordered categorical items | Annotates a dimension as containing numeric categorical values with a natural order (also see CATEGORICAL ) |
SUMMABLE |
Semantic | Numeric values that can be summed | Annotates a dimension as containing summable values |
SURVEY_DURATION |
Semantic | Survey duration indicator | Annotates a dimension to be treated as a survey duration indicator |
SURVEY_INTERVIEWER_ID |
Semantic | Survey interviewer identifier | Annotates a dimension to be treated as a survey interviewer ID |
SURVEY_META |
Semantic | Survey meta-information | Annotates a dimension to be treated as survey meta-information |
SURVEY_RESPONSE |
Semantic | Survey response values | Annotates a dimension to be treated as a survey response |
AGGREGATION_WEIGHT |
Analysis | Weight aggregations by values in this dimension, typically used for analysis of survey data to reduce bias | Annotates a dimension to be treated as a weighting variable, e.g., when analyzing survey data |
DEPENDENT_VARIABLE |
Analysis | An input variable of interest that should be explained | Annotates a dimension to be treated as a dependent variable in the current analysis. Also known as a target or label in machine learning. |
INDEPENDENT_VARIABLE |
Analysis | A control variable that is used to explain effects on a dependent variable | Annotates a dimension to be treated as an independent variable in the current analysis. Also known as a predictor or feature in machine learning. |
OVERRIDE_RESTRICTIONS |
Analysis | Disable analysis restrictions for performance optimisation | Annotates a dimension to be analyzed without any restrictions that would usually be in place to ensure acceptable runtime when analyzing very large tables. Use with caution! |
Advanced users may also prefer to embed these annotations directly in their data, by appending them to column labels enclosed in curly brackets, e.g., {SUMMABLE}
.
Best Practices
- Prioritize sparingly, but with confidence – In most cases, it is not necessary to fine-tune the priorities of every dimension of a dataset. It’s more time-efficient to quickly adjust the priorities of the most important dimensions, and then later use tags to filter out less important results.
- Annotate selectively – Annotations help the system to correctly handle the dimensions of a dataset in all corner cases. This means that in most cases the correct analytical methods will be applied, even without annotations. If pressed on time, some users may even do a quick initial run with the I’m feeling lucky button, check key results for issues, and add only annotations required to address these issues.
- Re-use prior priorities and annotations – Priorities and annotations of all past analyses are scanned to make the best possible suggestion for the current dataset. This includes datasets from other users (with accounts on the same Inspirient service instance). Suggested priorities and annotations may thus reflect what your co-workers may find appropriate for your data at hand.