Data Processing Layer
Raw data cannot be immediately used for data analysis. Our Data Processing Layer will process the raw data to remove noises, verify the veracity of the data (to prevent people from populating fake data) and restructure the data via data schemas.
Data Veracity Verification
Due to the incentives present in the network, it is easy to foresee that bad behaviors such as populating fake data will be imminent. We will employ a hybrid approach to verify veracity of the data, since there are no single methods that are fail-proof. This includes but is not limited to the following:
AI / ML algorithms for anomaly detection (such as Autoencoders, Isolation Forests, Z-Score, etc. )
Assertion checks enforced by data schemas (rejecting data entries that do not meet certain criteria)
Data Provenance Checks
Data Consistency Checks
We will not disclose the details of our approach in order to prevent it from being reverse engineered and hence exploited.
Data Mapping & Schema Management
The volume of information is ever expanding but not all information is valuable. Since the storage and processing capabilities of the network can never be unlimited, it is important to identify data fields that are meaningful (discard the rest)and include their definition (i.e what does the data field mean, since raw data fields may be written in shortforms, abbreviated, etc.). This is the role of data mapping and data schemas. Within the network, we refer to the combined output of data mapping and data schemas as data templates.
Just as there are infinite ways of interpreting the same piece of data, they are infinite ways of defining data schemas. Hence, we have decided to decentralize this module, meaning that anyone can participate in defining data schemas and be rewarded should the data schema defined is valid and meaningful.
Data Mapping
Source File Name (Absolute or Conditional)
Source Field Name (Raw Field Name in Source File)
Target Field Name (Optional)
Data Schema
By default, a single file will correspond to a single table / graph.
Type of Database (Relational / Graph / Hybrid)
Entity Information
Provenance Information
Table Names
Fields
Raw Name (Tied to the Source Field Name of the Data Mapping File)
Canonical Name (The standardized or human-readable name of the data field, optional)
Data Type
Definition / Description
Field Format (Optional)
Assertions (Optional)
Data Sensitivity / Privacy Level (Optional, used for removing Personally Identifiable Information)
IsNullable
IsKey
For more information, please see the Roles - Data Schema Developer page to see how you can participate in this meaningful endeavor!
Data Desensitization and Preprocessing
After verifying the veracity of the data and the data fields to be ingested, the data will undergo desensitization and preprocessing so that privacy of the data contributor is guaranteed and the dataset will be suitable for data analytics. This includes the following but is not limited to:
Desensitization (Removing Personal Identifiable Information (PII) )
Handling Missing Data (FillNA, FillZero, Mean, etc.)
Data Standardizing (For Structured Data Fields)
Structured Data Conversion
Type Conversion
One-Hot Encoding
Last updated