You are not logged in.

#1 2020-09-14 02:42:59

From: Germany, Uhldingen-Muhlhofen
Registered: 2020-09-14
Posts: 1

Build Reliable Data Lakes with Open Delta Lake

Build Reliable  Data Lake s with Open Delta Lake.
Create a central source of truth for data science, machine learning, and  analytics                                                                        Get Started                                                                                    This overview walks users through how to collect data into a data lake to serve different data use cases.
Learn how to ingest data into your data lake, manage its ETL and security, .

And enable downstream data teams access for ML and BI

The  Challenge .

BEFORE    Data silos from traditional data warehouse not handling unstructured data

additional systems needed.
Complexity and cost of transferring data between multiple disparate data systems Proprietary data formats prevent direct  data access  with other tools and increases lock-in risk.
Non-SQL use cases require new copies of data for data science and  machine learning .
Performance bottlenecks with data throughput slowing down data team agility and  productivity .
Increased cost & governance  challenges  managing multiple copies of data and security models.
The Solution.
AFTER    Modern data lakes handling all  structure d and unstructured data in a central repository.
Cost effective pipelines to progressively refine reliable data through  data lake  tables.
Open data formats ensure data is accessible across all  tools  and teams, reducing lock-in risk.
SQL and ML together on your data lake with a single copy of data.
Fast data for downstream streaming analytics, data science exploration, and model training.
Build once, access many times across use cases for a consolidated administration and self-service.
Build and scale reliable data lakes.
Start with an open data strategy and open technologies.
Your data is a strategic asset.
Whether for decision making, key business processes, or direct revenue generation, data should be managed carefully.
The last thing you want is to have it locked inside a proprietary system or a closed data format that leaves you vulnerable to vendor pricing, contracts, or technology decisions.
Open data lakes ensure your data is always accessible, unlike traditional data warehouses.
Delta Lake is an open source storage layer that adds data reliability and performance to your data lake and is built with open data formats, open APIs, and is hosted by the Linux Foundation.

Data can also be directly accessed with different tools and technologies

Learn more                                                                                                                                                                                                                                 Collect all the data in your company together.

Data shouldn’t be siloed in applications

databases, or file storage.
Start with a broad set of data ingestion capabilities to easily populate your data lake, including partner data integrations, auto loader from blob storage, idempotent copy command, and data source APIs.
Leverage the right approach for your architecture to land raw operational data from your systems into your central data lake on cost-effective cloud storage, without compromising data reliability or security.
Learn more                                                                                                      Ensure data reliability for production data lakes.
Your data lake needs to be reliable in order for it to be trusted by downstream data scientists and data analysts.

Delta Lake is an open source storage layer for your existing data lake

and uses versioned Apache Parquet™ files and a transaction log to keep track of all data commits, which enables many reliability capabilities.
Maintain data integrity, even with multiple data pipelines concurrently reading and writing data to your data lake, with ACID transactions.

Ensure data types are correct and required columns are present with Schema Enforcement

and update these requirements over time with Schema Evolution.
Learn more                                                                                                                                                                                                                                 Data lifecycle management for your data lake.
As your data lake grows, it becomes increasingly important to manage its data lifecycle.
Update, Merge, and Delete data from your data lake with DML commands, such as for GDPR compliance when user records need to be removed from all tables, .

Or as part of a Change Data Capture process

Revert to previous data versions with Time Travel for auditing, roll back, or reproducibility, such as for supporting the needs of downstream data teams or for ETL troubleshooting.
Maintain data lake performance and hygiene with optimize and vacuum commands, and manage your data across Azure and AWS for multi-cloud strategies.
Learn more                                                                                                      Enterprise ready administration, controls, and security.
Democratizing data access requires granular security controls and automation.
Without it, platform teams are forced into a losing decision: either make data openly accessible to everyone (and risk data security), or to lock all data down (and stifle business productivity).
Architect a unified cloud security posture with a broad portfolio of administrative capabilities, including IAM roles, Access Controls, Encryption, and Audit Logging.

Databricks keeps data in your own cloud infrastructure account

not in a vendor owned account, ensuring your control of your data.

Leverage APIs and monitoring to scale administrative workflows and operations

Databricks scales to multiple petabytes of data for sensitive business-critical use cases, with certifications including HIPAA, SOC2, PCI, and more.
Learn more                                                                                                                                                                                                                                 Progressively and continuously refine data.
You can now leverage a “medallion” model to progressively refine your raw data into business-level aggregates as it streams between different data quality tables.
Raw data initially lands into a bronze table, which is then filtered, cleaned, and augmented into a silver table.
A final gold table holds business-level aggregates that can be readily accessed by business analysts and data scientists.
This multi-hop data processing approach brings many benefits, including data quality checkpoints for fault recovery, simple gold table reprocessing for new business logic or data, and a continuous data flow for complete and recent data.
Learn more                                                                                                      Empower analysts with more complete and recent data.
SQL reporting, dashboarding, and BI can now be powered by the more complete and recent data offered by data lakes, given their faster data loading, increased data type flexibility, and cost effective cloud blob storage.
The “medallion” data refinement models gives data analysts the flexibility to not only consume business data aggregates, but also drill into the raw, unprocessed data for deeper analysis, such as for investigating anomalies.
Data analysts previously would have to switch to other systems and technologies to access these details.
BI Visualization tools like Tableau, PowerBI, Looker, or any other ODBC/JDBC compatible tool can be easily connected, and easily expand SQL analysis into data science, with the same data, platform, and governance.
Learn more                                                                                                                                                                                                                                 Self-service for data scientists and machine learning engineers.
With complete, reliable, and secure data available in your data lake, your data teams are now ready to run exploratory data science experiments and build production ready machine learning models.
Integrated cloud-based collaborative notebooks with Python, Scala, and SQL make it easy for teams to share analysis and results.

Databricks Connect lets teams attach their preferred IDE or Notebook

such as IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio, Zeppelin, Jupyter, or other custom applications.
And manage the end-to-end machine learning lifecycle with MLflow for experiment tracking, model registry, production deployment and more.
Learn more                                                                                                      Connect data applications to drive business operations.
Improve key business processes with the most complete business insights into your customers, products, markets, and more.
Data applications can leverage your data lake to power a wide variety of industry use cases.
Whether it’s personalizing customer experiences in media, optimizing prices in retail, fighting fraud in financial services, or drug discovery in life sciences, complete and reliable data in your data lake can power dozens of different streaming streaming applications throughout your business.

Delta Lake is open format and open source

meaning that your data lake can be openly accessed by all of your applications and tools for all your business needs.
Learn more                                                                                                                                                                                                                    Migrate slow legacy systems to modern cloud data lake.
You may already have a legacy data warehouse or an on-premise Hadoop data lake that is not able to meet the growing demands of your data teams, with issues such as complex operations, data reliability issues, or performance bottlenecks causing data initiatives to fail.
Migrate to a scalable, managed cloud data platform to increase productivity, cut costs, and create more value from your data.
Databricks has worked with many customers as part of their cloud journey to move workloads, transfer data, and manage change.
Learn more                                                                                                                                                                                                                                       Customer Stories.

Comcast’s journey to building an agile data and AI platform at scale with Databricks

Learn about Comcast’s data and machine learning infrastructure built on Databricks Unified Data Analytics Platform.
Comcast processes petabytes of content and telemetry data, with millions of transactions a second.
Their data lake with Delta Lake is used to train their ML models and improve the customer experience of their Emmy-award winning service.
Learn more                                                      See more customer stories                                                                                 Ready to Get Started.
Sign up for your free account.


Board footer

Powered by FluxBB