Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Big Data normalization for massively parallel processing databases
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Number of Authors: 2
2017 (English)In: Computer Standards & Interfaces, ISSN 0920-5489, E-ISSN 1872-7018, Vol. 54, no 2, 86-93 p.Article in journal (Refereed) Published
Abstract [en]

High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. Furthermore, there is a contradiction between ease of extending the data model and ease of analysis. The modern 'Data Lake' approach, promises extreme ease of adding new data to a data model, however it is prone to eventually becoming a Data Swamp - unstructured, ungoverned, and out of control Data Lake where due to a lack of process, standards and governance, data is hard to find, hard to use and is consumed out of context. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. This technique is almost as convenient for expanding data model as a Data Lake, while it is internally protected from transforming to Data Swamp. A case study of how this approach is used for a Data Warehouse at Avito over a three-year period, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, is also presented. This paper is an extension of theses from The 34th International Conference on Conceptual Modeling (ER 2015) (Golov and Ronnback 2015) [1], it is complemented with numerical results about key operating areas of highly normalized big data warehouse, collected over several (1-3) years of commercial operation. Also, the limitations, imposed by using a single MPP database cluster, are described, and cluster fragmentation approach is proposed.

Place, publisher, year, edition, pages
2017. Vol. 54, no 2, 86-93 p.
Keyword [en]
Big Data, MPP, Database, Normalization, Analytics, Ad-hoc, Querying, Modeling, Performance, Data Lake
National Category
Electrical Engineering, Electronic Engineering, Information Engineering Computer and Information Science
Identifiers
URN: urn:nbn:se:su:diva-144754DOI: 10.1016/j.csi.2017.01.009ISI: 000401888800004OAI: oai:DiVA.org:su-144754DiVA: diva2:1127699
Available from: 2017-07-18 Created: 2017-07-18 Last updated: 2017-07-18Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Rönnbäck, Lars
By organisation
Department of Computer and Systems Sciences
In the same journal
Computer Standards & Interfaces
Electrical Engineering, Electronic Engineering, Information EngineeringComputer and Information Science

Search outside of DiVA

GoogleGoogle Scholar

Altmetric score

Total: 9 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf