Big data is a substantial volume of data and datasets that come in diverse forms and multiple sources. Many organisations have recognised the advantages of collecting as much data as possible. Here, Sanchit talks about how organisations can leverage big data analytics to transform terabytes of data into actionable insights and unlock big data’s potential. Read on:
Currently, I am working in the capacity of a Manager. My responsibilities include:
Helping customers with their ad-hoc requests
Managing multiple engagements
Supporting the Pre-sales team in their pitches to new customers .
Providing technical expertise to all the colleagues and customers.
How do you reconcile Big Data and relevant data visualisation?
When considering a Big Data solution, it is crucial to keep in mind the architecture of a traditional BI system and how Big Data comes into play. Well, we have been working with structured data coming mainly from RDBMS (Row-based Table Structure) loaded into a DWH (Data Warehouse), ready to be analysed and shown to the business users.
Big data solutions allow the system to process higher volumes of data much faster, which can be more diverse, giving the chance to extract information efficiently and safely from data that a traditional solution can’t. In addition, using Big Data permits the hardware structure to grow horizontally, which is more economical and flexible. So, how does Big Data enter this ecosystem? The main architecture concepts are pretty similar, but there are significant changes. The main differences are a whole new set of data sources, specifically non-structured and a completely new environment to store and process data.
Tableau uses drivers leveraging the Open Database Connectivity (ODBC) programming standard as a translation layer between SQL and SQL-like data interfaces provided by these big data platforms. By using ODBC, you can access any data source that supports the SQL standard and implements the ODBC API. For Hadoop, this includes interfaces such as Hive Query Language (HiveQL), Impala SQL, BigSQL and Spark SQL. To achieve the best performance possible, we custom tune the SQL we generate and push down aggregations, filters, and other SQL operations to the big data platforms. Building visualisations Using Impala/ Hive direct ODBC connection, Tableau can connect and run faster queries in extract and live relation.
Source: https://www.clearpeaks.com/big-data-ecosystem-spark-and-tableau/
A robust pipeline that can scale up as needed will be the best architecture at any point in time.
In modern ingest-and-load design patterns, the destination for raw data of any size or shape is often a data lake: a storage repository that holds a vast amount of data in its native format, whether structured, semi-structured, or unstructured.
Stream data is generated continuously by connected devices and apps, such as social networks, smart meters, home automation, video games, and IoT sensors. Often, this data is collected via pipelines of semi-structured data.
We should still make sure that we avoid bringing all data into any visualising application; the less the data faster will be the response time. Bringing in relevant information helps increase the query performance and enables you to find the answers you are looking for quickly.
A variety of options exist today for streaming data, including Amazon Kinesis, Storm, Flume, Kafka, and Informatica Vibe Data Stream.
Data lakes also provide optimised processing mechanisms via APIs or SQL-like languages for transforming raw data with “schema on reading” functionality. Once data has landed in a data lake, it needs to be ingested and prepared for analysis. Tableau has partners like Informatica, Alteryx, Trifacta, and Datameer that help with this process and work fluidly with Tableau. Alternately, for self-service data prep, you can use Tableau Prep.
A modern analytics platform like Tableau may be the key to unlocking big data’s potential through discovering insights but is still just one of the critical components of a complete significant data platform architecture. Putting together an entire big data analytics pipeline can seem like a challenge.
The good news is that you don’t need to build out the whole ecosystem before you get started, nor do you need to integrate every single component for an entire strategy to get off the ground. Tableau fits nicely in the big data paradigm because it prioritises flexibility—the ability to move data across platforms, adjust infrastructure on demand, take advantage of new data types, and enable new users and use cases.
We believe that deploying a big data analytics solution shouldn’t dictate your infrastructure or strategy but should help you leverage the investments you’ve already made, including those with partner technologies within the extensive data ecosystem.
Storage and processing
Hadoop allows for low-cost storage and data archival for offloading old historical data from the data warehouse (DWH)into online cold stores in a modern analytics architecture. It is also used for IoT, data science, and unstructured analytics use cases. Tableau provides direct connectivity to all the major Hadoop distributions with Cloudera via Impala, Hortonworks via Hive, and MapR via Apache Drill.
Snowflake is one example of a cloud-native SQL-based enterprise data warehouse with a native Tableau connector.
Object stores, such as Amazon Web Services Simple Storage Service (S3) and NoSQL databases with flexible schemas can also be used as data lakes. Tableau supports Amazon’s Athena data service to connect to Amazon S3 and has various tools related to NoSQL databases. Examples of NoSQL databases that are often used with Tableau include, but are not limited to, MongoDB, Datastax, and MarkLogic.
The data science and engineering platform Databricks offers data processing on Spark, a popular engine for both batch-oriented and interactive, scale-out data processing. Through a native connector to Spark, you can visualise the results of complex machine learning models from Databricks in Tableau.
Query acceleration
How fast is the interactive SQL? SQL, after all, is the conduit to business users who want to use big data for faster, more repeatable KPI dashboards and exploratory analysis. This need for speed has made the adoption of faster databases leveraging in-memory and massive parallel processing (MPP) technology like Exasol and MemSQL, Hadoop-based stores like Kudu, and technologies that enable faster queries with preprocessing like Vertica. Using SQL-on-Hadoop engines like Apache Impala, Hive LLAP, Presto, Phoenix, and Drill, and OLAP-on-Hadoop technologies like AtScale, Jethro Data, and Kyvos Insights, these query accelerators are further blurring the lines between traditional warehouses and the world of big data.
While no two enterprise architectures are the same, noting similar patterns and what they share in common can help you strategise your own big data analytics platform. Here is what we’ve observed consistently in successful big data analytics architectures:
1. A storage layer: Your data strategy may necessitate multiple storage environments but should comprise structured, semi-structured, and unstructured data.
2. Server and serverless compute engines: Some for-data preparation and analytics, other compute engines for querying. The dynamic nature of serverless computing allows for more flexibility and elasticity, as there is no need to pre-allocate resources.
3. Support for volume, velocity and variety: This applies not just to data, but its growing complexity and number of use cases, some of which are yet to be discovered.
4. The right tool for the job: It’s essential to adapt the components of your architecture to address your unique data strategy. Still, it’s also critical to remain agile in changing business needs.
5. Enterprise-level governance and security: While we haven’t gone into much detail in these areas, security and governance are foundational for ensuring scalability and proper use of your data.
6. Cost consciousness: Take cost into account when considering the necessary power and flexibility for your significant data architecture. The cloud affords a lot of elasticity for growth, but you’ll want to consider the financial implications of your data storage and processing, concurrency, latency, analytics use cases, etc.
Source:https://www.tableau.com/sites/default/files/2021-02/EN_tableau_big_data_overview_whitepaper.pdf