-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathfeed.xml
More file actions
1 lines (1 loc) · 199 KB
/
feed.xml
File metadata and controls
1 lines (1 loc) · 199 KB
1
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> <channel><title>OnDataEngineering</title><description>A collaborative site for independent, critical and technical thinking on the use cases, architectural patterns and technologies relating to the transformation and preparation of data for exploitation.</description><link>https://ondataengineering.github.io/</link><atom:link href="https://ondataengineering.github.io/feed.xml" rel="self" type="application/rss+xml" /> <item><title>The Mid Week News 18/09/2019</title><link>https://ondataengineering.github.io/blog/2019/09/18/the-mid-week-news/</link><pubDate>Wed, 18 Sep 2019 07:30:00 +0000</pubDate> <description> <p>Right - news time again.</p> <p>Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-calcite/">Apache Calcite</a> 1.21 is out - you might never have heard of it, but it’s probably being used by many of the data tools you use on a daily basis for query parsing and optimization - <a href="https://calcite.apache.org/news/2019/09/11/release-1.21.0/">link</a></li> </ul> <p>Other technology news:</p> <ul> <li>Cloudera have announced the release of Cloudera Streams Management - bundling their Kafka management console and replication tool - <a href="https://blog.cloudera.com/announcing-the-general-availability-of-cloudera-streams-management/">link</a></li> <li>Interested in Apache Samza, hear about LinkedIn’s journey with it - <a href="https://www.infoq.com/news/2019/09/linkedin-apache-samza/">link</a></li> <li>From the ever reliable The Morning Papers - Procella, YouTube’s unified OLTP/OLAP (HTAP) database - <a href="https://blog.acolyer.org/2019/09/11/procella/">link</a></li> <li><a href="/technologies/influxdb/">InfluxDBCloud</a> (their time series database as a service on AWS, Azure and GCP) has hit 2.0 and has now gone serverless - <a href="https://www.influxdata.com/blog/influxdb-cloud-2-0-launches-as-a-serverless-platform-for-time-series-data/">link</a></li> <li><a href="/technologies/google-cloud-dataproc/">Google Cloud DataProc</a> now supports (in alpha) running your Spark jobs on Kubernetes (GKE) rather than YARN VMs - <a href="https://cloud.google.com/blog/products/data-analytics/alpha-access-to-cloud-dataproc-jobs-on-gke">link</a></li> <li>More on on <a href="/technologies/google-cloud-dataproc/">Google Cloud DataProc</a> Spark on GKE from ZDNet - <a href="https://www.zdnet.com/article/google-announces-alpha-of-cloud-dataproc-for-kubernetes/">link</a></li> <li><a href="/technologies/qubole-data-service/">Qubole Data Service</a> is now available on <a href="/tech-vendors/google-cloud-platform/">Google Cloud Platform</a> if you’re looking for a cloud agnostic Hadoop as a service offering (that runs on Google Cloud) - <a href="https://www.qubole.com/blog/announcing-general-availability-of-qubole-on-google-cloud/">link</a></li> <li>Interested in replicating data between <a href="/technologies/apache-kafka">Kafka</a> clusters - Cloudera have a post on on MirrorMaker 2 which is based on <a href="/technologies/apache-kafka/kafka-connect/">Kafka Connect</a> - <a href="https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/09/18/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 11/09/2019</title><link>https://ondataengineering.github.io/blog/2019/09/11/the-mid-week-news/</link><pubDate>Wed, 11 Sep 2019 07:30:00 +0000</pubDate> <description> <p>Apologies - we’ve been off on holiday again, hence the radio silence. But we’re back, and with a big old news bump.</p> <p>Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> release 5.26 is out, with even better Spark performance</li> <li>And Amazon have also announced <a href="/technologies/amazon-emr/">Amazon EMR</a> 6.0, with support for Hadoop 3.1 and running Spark jobs in Docker containers</li> <li><a href="/technologies/apache-orc/">Apache ORC</a> 1.6 is out if you’re looking for columnar data storage on HDFS</li> <li><a href="/technologies/greenplum/">Greenplum</a> 6.0 is finally out if you’re looking for mature shared nothing MPP database</li> <li><a href="/technologies/apache-carbondata/">Apache CarbonData</a> 1.6 is out if you’re looking for indexed storage of data on HDFS with supports for batch inserts and updates</li> <li>Version 0.5 of the <a href="/technologies/apache-nifi/registry/">NiFi Registry</a> is out if you’re looking to configuration manage your NiFi flows</li> <li>Version 0.4 of <a href="/technologies/apache-myriad/">Apache Myriad</a> is out</li> <li><a href="/technologies/zenko/cloudserver/">Zenko CloudServer</a> has just released version 8.2</li> </ul> <p>Other technology news:</p> <ul> <li>Are you running an <a href="/technologies/apache-solr/">Apache Solr</a> version prior to 5.0 - if so there’s an XML bomb attack - <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-12401">link</a></li> <li>ApacheIoTDB - the Apache open source time series database focusing on IoT use cases has it’s first official release @ 0.8 - <a href="https://iotdb.apache.org/">link</a></li> <li>From the ever excellent The Morning Paper, a review of a paper that used “the TPC-H benchmark to assess Redshift, Redshift Spectrum, Athena, Presto, Hive, and Vertica to find out what works best and the trade-offs involved” - <a href="https://blog.acolyer.org/2019/08/30/choosing-a-cloud-dbms/">link</a></li> <li><a href="/technologies/elastic-cloud/">Elastic Cloud</a> is now available on <a href="/tech-vendors/microsoft-azure">Azure</a> - <a href="https://www.elastic.co/blog/elasticsearch-service-on-elastic-cloud-now-available-on-microsoft-azure">link</a></li> <li><a href="/technologies/confluent-open-source">Confluent Schema Registry</a> is now available as a cloud service in Confluent Cloud - <a href="https://www.confluent.io/blog/confluent-cloud-schema-registry-generally-available">link</a></li> <li>Looking for an open source <a href="/tech-categories/object-stores/">object store</a> - Datanami have the latest on MinIO - <a href="https://www.datanami.com/2019/08/26/minio-enjoying-role-in-emerging-cloud-architecture/">link</a></li> <li>From Datanami, <a href="/tech-vendors/cloudera/">Cloudera</a>’s Q2 results are better than expected - <a href="https://www.datanami.com/2019/09/06/cloudera-rebounds-in-q2-beats-expectations/">link</a></li> <li>StreamSets have announced StreamsetsTranformer - a graphical tool for creating <a href="/technologies/apache-spark/">Apache Spark</a> pipelines that’s part of their DataOps Platform - <a href="https://streamsets.com/blog/streamsets-transformer-is-here/">link</a></li> <li>Using <a href="/technologies/google-cloud-storage/">Google Cloud Storage</a> with <a href="/technologies/apache-hadoop">Hadoop</a> - Google have a new version of their Cloud Storage Connector for Hadoop out with a bunch of performance improvements and locking for directory modifications - <a href="https://cloud.google.com/blog/products/data-analytics/new-release-of-cloud-storage-connector-for-hadoop-improving-performance-throughput-and-more">link</a></li> <li>ApacheDolphinScheduler has just been accepted into the Apache Incubator - originally called Easy Scheduler, donated by Analysys, it’s a tool for distributed ETL scheduling - <a href="https://cwiki.apache.org/confluence/display/INCUBATOR/DolphinSchedulerProposal">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/09/11/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 14/08/2019</title><link>https://ondataengineering.github.io/blog/2019/08/14/the-mid-week-news/</link><pubDate>Wed, 14 Aug 2019 07:30:00 +0000</pubDate> <description> <p>It’s time for the mid week news again. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li>Looking for a Kafka alternative, <a href="/technologies/pravega/">Pravega</a> has just had a 0.5 release</li> <li><a href="/technologies/apache-ranger/">Apache Ranger</a> 2.0 is out - be interesting to see what the convergence roadmap with ApacheSentry is</li> <li><a href="/technologies/hue/">Hue</a> 4.5 is out, with a bunch of improved SQL integrations and Kubernetes/Docker support</li> </ul> <p>Other technology news:</p> <ul> <li>From Datanami, (shock/horror) why Data Catalogs are important, and what some of the options are - <a href="https://www.datanami.com/2019/08/07/data-catalogs-seen-as-difference-makers-in-big-data/">link</a></li> <li>If you’re learning <a href="/technologies/apache-kafka/">Apache Kafka</a>, Confluent now have an entire site of tutorials for you - <a href="https://www.confluent.io/blog/announcing-apache-kafka-tutorials">link</a></li> <li>From <a href="/tech-vendors/amazon-web-services/">Amazon Web Services</a>, deploying data lakes using AWSLakeFormation - <a href="https://aws.amazon.com/blogs/big-data/building-securing-and-managing-data-lakes-with-aws-lake-formation/">link</a></li> <li>Via InfoQ, details on how Badoo handle 20 billion events per day, staring LiveStreamingDaemon (an open source replacement for Facebook’s Scribe), ORC, HDFS, Exasol, Spark and CubeDB (an in memory multi-key counter store) - <a href="https://www.infoq.com/news/2019/08/badoo-20-billion-events-per-day">link</a></li> <li>Couple from Datanami - firstly their latest thoughts on the future of <a href="/technologies/apache-hadoop">Hadoop</a> - <a href="https://www.datanami.com/2019/08/12/re-imagining-big-data-in-a-post-hadoop-world/">link</a></li> <li>And also from Datanami, an update on Lucidworks who provide <a href="/technologies/apache-solr/">Apache Solr</a> based products, and who’ve just secured a new funding round - <a href="https://www.datanami.com/2019/08/12/investors-search-for-lucidworks/">link</a></li> <li>More information on AWSLakeFormation, this time from InfoQ - <a href="https://www.infoq.com/news/2019/08/aws-lake-formation-ga/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/08/14/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 07/08/2019</title><link>https://ondataengineering.github.io/blog/2019/08/07/the-midweek-news/</link><pubDate>Wed, 07 Aug 2019 07:30:00 +0000</pubDate> <description> <p>Another big dump of new releases and product news this week. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> and <a href="/technologies/elasticsearch-hadoop/">Elasticsearch-Hadoop</a> 7.3 are now out</li> <li>Alpha 2 of <a href="/technologies/elastic-cloud/">Elastic Cloud</a> on Kubernetes is also out - <a href="https://www.elastic.co/blog/announcing-elastic-cloud-on-kubernetes-eck-0-9-0-alpha-2">link</a></li> <li><a href="/technologies/cloudera-cdh">Cloudera Enterprise</a> 6.3 is out, with upgrades to Kafka, HBase, Impala and Kudu</li> <li><a href="/technologies/cloudera-altus/director/">Cloudera Altus Director</a> is also up to 6.3</li> <li><a href="/technologies/confluent-enterprise/">Confluent Enterprise</a> and <a href="/technologies/confluent-open-source/">Confluent Open Source</a> are up to 5.3, with support for Kubernetes and Role Based Access Control</li> <li><a href="/technologies/apache-beam/">Apache Beam</a> 2.14 is out</li> <li><a href="/technologies/apache-accumulo/">Apache Accumulo</a> 2.0 is out, with a new fluent API, Bulk Import API, custom table stats, prioritisable scan executors, and of course, official Docker support</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> 5.25 is out, with a couple of Spark performance improvements</li> <li><a href="/technologies/streamsets-data-collector/">StreamSets Data Collector</a> 3.10 is out, with support for ingesting data directly from NiFi</li> <li>Delta Lake, the open sourced <a href="/technologies/databricks-delta/">Databricks Delta</a>, is up to 0.3 - <a href="https://databricks.com/blog/2019/08/02/announcing-delta-lake-0-3-0-release.html">link</a></li> </ul> <p>Other technology news:</p> <ul> <li>From ZDNet, DGraph - an open source graph database written in Go - has just received a funding round - <a href="https://www.zdnet.com/article/you-can-go-your-own-graph-database-way-dgraph-secures-115m-to-pursue-its-opinionated-path/">link</a></li> <li>If you’re interested in Brooklin, the open source tool from LinkedIn for moving streaming data around, InfoQ have a presentation for you - <a href="https://www.infoq.com/presentations/linkedin-streams-brooklin/">link</a></li> <li><a href="/tech-vendors/mapr/">MapRs</a> long running will they/won’t they saga has come to an end with their purchase by HPE - Datanami ask why - <a href="https://www.datanami.com/2019/08/05/what-hpe-sees-in-mapr-technologies/">link</a></li> <li>From Cloudera, best practice for deploying <a href="/technologies/apache-hadoop/">Apache Hadoop</a>, including your Linux configuration - <a href="https://blog.cloudera.com/how-to-deploy-apache-hadoop-clusters-like-a-boss/">link</a></li> <li>An intro to Prefect, an evolution of <a href="/technologies/apache-airflow/">Apache Airflow</a> to support modern data applications - open source with commercial backing - <a href="https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4">link</a></li> <li>Looks like <a href="/tech-vendors/microsoft-azure/">Microsoft</a> has purchased BlueTalon, which could mean better fine grained access control over your data ecosystem in Azure - <a href="https://blogs.microsoft.com/blog/2019/07/29/microsoft-acquires-bluetalon-simplifying-data-privacy-and-governance-across-modern-data-estates/">link</a></li> <li>Amazon have announced PartiQL, an evolution of SQL and SQL++ for querying relational, NoSQL and other structured data - <a href="https://aws.amazon.com/blogs/opensource/announcing-partiql-one-query-language-for-all-your-data/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/08/07/the-midweek-news/</guid> </item> <item><title>The Mid Week News 31/07/2019</title><link>https://ondataengineering.github.io/blog/2019/07/31/the-mid-week-news/</link><pubDate>Wed, 31 Jul 2019 07:30:00 +0000</pubDate> <description> <p>Apologies - forgot to say I was going on holiday! Apologies for the lack of updates, and the bigger than usual update this week. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/greenplum/">Greenplum</a> is up to release 5.12</li> <li><a href="/technologies/apache-knox/">Apache Knox</a> 1.3 is out, with support for <a href="/technologies/hue">Hue</a>, <a href="/technologies/cloudera-manager/">Cloudera Manager</a> and <a href="/technologies/apache-livy/">Apache Livy</a></li> <li><a href="/technologies/apache-solr/">Apache Solr</a> 8.2 is out</li> <li><a href="/technologies/cloudera-data-science-workbench/">Cloudera Data Science Workbench</a> 1.6 is out, with support for multiple editors including Jupyter, RStudio and PyCharm</li> <li><a href="/technologies/elastic-cloud/">Elastic Cloud</a> Enterprise 2.3 is out, with support for role based access control</li> </ul> <p>Other technology news:</p> <ul> <li>From Cloudera, details on YuniKorn, their new universal resource scheduler that can underpin both Kubernetes and Hadoop YARN resource scheduling - <a href="https://blog.cloudera.com/blog/2019/07/yunikorn-a-universal-resource-scheduler/">link</a></li> <li>From Datanami, thoughts on <a href="/tech-categories/object-stores/">Object Stores</a> and <a href="/tech-categories/hadoop-compatible-filesystems">ScaleOutFilesystems</a> - <a href="https://www.datanami.com/2019/07/17/object-and-scale-out-file-systems-fill-hadoop-storage-void/">link</a></li> <li>There’s a bunch of vulnerabilities announced for <a href="/technologies/apache-storm/">Apache Storm</a> - CVE-2018-11779; CVE-2018-1320 and CVE-2019-0202</li> <li><a href="/technologies/microsoft-azure-data-lake-store">Azure Data Lake Store</a> now supports Azure Blob Storage API as well as an HDFS compatible API - <a href="https://azure.microsoft.com/en-gb/blog/silo-busting-2-0-multi-protocol-access-for-azure-data-lake-storage/">link</a></li> <li>From Datanami, Ascend has launched with an Autonomous Dataflow Service - looks interesting, if anyone has any thoughts please share - <a href="https://www.datanami.com/2019/07/19/ascend-launches-from-stealth-with-19m/">link</a></li> <li>Redis now has a module that turns it into a <a href="/tech-categories/time-series-databases/">Time Series Databases</a> - <a href="https://redislabs.com/blog/redistimeseries-ga-making-4th-dimension-truly-immersive/">link</a></li> <li><a href="/technologies/cloudera-cdh">CDH</a> now supports and will include <a href="/technologies/apache-phoenix/">Apache Phoenix</a> for SQL over HBase - <a href="https://blog.cloudera.com/blog/2019/07/apache-phoenix-for-cdh/">link</a></li> <li>Google Big Query now has a new user interface, GIS functions, persistent UDFs and more - <a href="https://cloud.google.com/blog/products/data-analytics/new-persistent-user-defined-functions-increased-concurrency-limits-gis-and-encryption-functions-and-more">link</a></li> <li>From the ODBS Industry Watch blog, an interview on LeanXcale, a hybrid Transactional Analytical (HTAP) database - I’m still not sure what to make of these, thoughts welcomed - <a href="http://www.odbms.org/blog/2019/07/on-leanxcale-database-interview-with-patrick-valduriez-and-ricardo-jimenez-peris/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/07/31/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 17/07/2019</title><link>https://ondataengineering.github.io/blog/2019/07/17/the-mid-week-news/</link><pubDate>Wed, 17 Jul 2019 07:30:00 +0000</pubDate> <description> <p>News news news again. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-kudu/">Apache Kudu</a> 1.10 is out, with table backup/restore, metadata sync with Hive Metastore, and native fine-grained authentication via ApacheSentry</li> <li><a href="/technologies/hortonworks-dataplane/data-lifecycle-manager/">Data Lifecycle Manager</a> (part of the DataPlane Platform) is up to 1.5 if you’re looking for a tool to replicate HDFS and Hive data between clusters</li> <li><a href="/technologies/hortonworks-dataplane/data-analytics-studio/">Data Analytics Studio</a> (part of the DataPlane Platform) is up to 1.3 is you’re looking for a tool to run and diagnose performance issues with Hive queries</li> </ul> <p>Other technology news:</p> <ul> <li><a href="/tech-vendors/cloudera/">Cloudera</a> have announced the licensing model for the new company - TLDR, they’re sticking with Apache and AGPL licences, sticking with the Apache foundation, and the ex-Cloudera commercial components will all be open sourced - <a href="http://vision.cloudera.com/our-commitment-to-open-source-software/">link</a></li> <li>From the Starburst blog, part sales pitch, but a good case for separation of storage and compute and keeping your architecture open - <a href="https://www.starburstdata.com/technical-blog/the-power-of-optionality-in-big-data/">link</a></li> <li>LinkedIn have open sourced Brooklin, their tool for replicating streaming data between streaming data stores and/or databases - we’ve added this to our list of <a href="/tech-categories/data-ingestion/">Data Ingestion</a> technologies - <a href="https://engineering.linkedin.com/blog/2019/brooklin-open-source">link</a></li> <li>Databricks Runtime 5.5 is out - <a href="https://databricks.com/blog/2019/07/16/announcing-databricks-runtime-5-5-and-runtime-5-5-for-machine-learning.html">link</a></li> <li>This looks really interesting - from Datanami, an intro to Dagster, an open source tool for creating data applications using a functional paradigm, with support for a range of languages and integrations out of the box - <a href="https://www.datanami.com/2019/07/16/dagster-emerges-to-simplify-data-app-development/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/07/17/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 10/07/2019</title><link>https://ondataengineering.github.io/blog/2019/07/10/the-mid-week-news/</link><pubDate>Wed, 10 Jul 2019 07:30:00 +0000</pubDate> <description> <p>It’s time for our weekly dose of the news. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-arrow/">Apache Arrow</a> 0.14 is out</li> </ul> <p>Other technology news:</p> <ul> <li>From the ever excellent The Morning Paper - a SIGMOD’19 paper from <a href="/technologies/apache-beam">Beam</a>, <a href="/technologies/apache-calcite">Calcite</a> and <a href="/technologies/apache-flink">Flink</a> experts on unifying SQL over tables and streams - <a href="https://blog.acolyer.org/2019/07/03/one-sql-to-rule-them-all/">link</a></li> <li>Via The Register, it looks like Netezza is finally dead - discontinued by IBM - <a href="https://www.theregister.co.uk/2019/07/03/rip_netezza_ibms_fpgapowered_data_warehousing_dream/">link</a></li> <li>From The Register this time - it looks like <a href="/tech-vendors/mapr/">MapR</a> has missed it’s deadline for sale - <a href="https://www.theregister.co.uk/2019/07/04/mapr_misses_deadline_for_sale/">link</a></li> <li>From InfoQ, details on LinkedIn’s Brooklin that they use for moving streaming data around - apparently being open sourced soon - <a href="https://www.infoq.com/news/2019/07/brooklin-data-streaming-service">link</a></li> <li>Looking for an <a href="/technologies/apache-kafka/">Apache Kafka</a> alternative - Apache Pulsar has hit 2.4 - <a href="https://github.com/apache/pulsar/releases/tag/v2.4.0">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/07/10/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 03/07/2019</title><link>https://ondataengineering.github.io/blog/2019/07/03/the-mid-week-news/</link><pubDate>Wed, 03 Jul 2019 07:30:00 +0000</pubDate> <description> <p>News time again! Quieter this week. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/alluxio/">Alluxio</a> 2.0 is out if you’re looking for a distributed analytics storage layer / cache</li> <li><a href="/technologies/apache-druid">Apache Druid</a> 0.15 is out if you’re looking for real time OLAP / star schema queries on real time data</li> <li><a href="/technologies/scality-ring/">Scality RING</a> 8 is out - object storage with support for hybrid cloud</li> </ul> <p>Other technology news:</p> <ul> <li>From Datanami - entity resolution and Senzing. If you haven’t heard of either it’s well worth a read - <a href="https://www.datanami.com/2019/06/27/inside-jeff-jonas-big-plan-to-democratize-entity-resolution/">link</a></li> <li>From ZDnet - an update on data catalogs, and specifically Waterline Data - <a href="https://www.zdnet.com/article/multi-cloud-data-catalogs-the-easy-way-using-metadata-and-machine-learning-by-waterline-data/">link</a></li> <li>Matt Turck has his 2019 Data &amp; AI thoughts out, along with his technology landscape picture that’s always worth a look - <a href="https://mattturck.com/data2019/">part 1</a>; <a href="https://mattturck.com/2019trends/">part 2</a></li> <li>Interested in a deep dive on the <a href="/technologies/apache-kylin/">Apache Kylin</a> OLAP engine - their blog has you covered - <a href="http://kylin.apache.org/blog/2019/07/01/deep-dive-real-time-olap/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/07/03/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 26/06/2019</title><link>https://ondataengineering.github.io/blog/2019/06/26/the-mid-week-news/</link><pubDate>Wed, 26 Jun 2019 07:30:00 +0000</pubDate> <description> <p>It’s time for our weekly news summary again, and it’s a but of a bumper week. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/greenplum/">Greenplum</a> 5.20 is out</li> <li><a href="/technologies/apache-kafka/">Apache Kafka</a> 2.3 is out</li> <li><a href="/technologies/apache-calcite/">Apache Calcite</a> 1.20 is out</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> 7.2 is out</li> </ul> <p>Other technology news:</p> <ul> <li>More from Datanami on <a href="/tech-vendors/mapr/">MapR</a>, who are apparently close to finding a buyer for the company - <a href="https://www.datanami.com/2019/06/18/mapr-says-its-close-to-deal-to-sell-company/">link</a></li> <li><a href="/technologies/hue">Hue</a> now supports <a href="/technologies/apache-atlas/">Apache Atlas</a> as a metadata catalog backend - potentially worth a look if you’re an HDP user - <a href="http://gethue.com/realtime-catalog-search-with-hue-and-apache-atlas/">link</a></li> <li>From Datanami, Data.world (an enterprise knowledge graph vendor) has brought Capsenta, who have a relational database to knowledge graph virtualization bridge - <a href="https://www.zdnet.com/article/data-world-joins-forces-with-capsenta-to-bring-knowledge-graph-based-data-management-and-consumer-grade-ui-to-the-enterprise/">link</a></li> <li>DatabricksRuntime 5.4 is out, including DataBricksConnect, auto optimisation of table layouts and support for the AWS Glue metastore - <a href="https://databricks.com/blog/2019/06/20/announcing-databricks-runtime-5-4.html">link</a></li> <li>AWSGlue now supports workflows - <a href="https://aws.amazon.com/about-aws/whats-new/2019/06/aws-glue-now-provides-workflows-to-orchestrate-etl-workloads/">link</a></li> <li>Qubole think that using <a href="/technologies/amazon-s3/">Amazon S3</a> Select (where filters on S3 queries are run server side) can speed up <a href="/technologies/apache-spark/">Apache Spark</a> processing by 2.9x - [link](https://www.qubole.com/blog/amazon-s3-select-integration/</li> <li>Bloor have a Market Update on hybrid real-time operational/transactional and analytic processing - [link](https://www.bloorresearch.com/research/hybrid-real-time-data-processing/</li> <li>SwiftStack have announced a Data Analytics storage solution build on SwiftStackStorage, SwiftStack1Space for multi-cloud federation and <a href="/technologies/alluxio/">Alluxio</a> for in memory caching and federation over HDFS - [link](https://www.swiftstack.com/blog/2019/06/24/multi-cloud-data-lake-%e2%80%93-evolving-from-descriptive-to-predictive-to-cognitive-analytics/</li> <li><a href="/tech-vendors/cloudera/">Cloudera</a> have just run a preview of CDP, which they’re calling an Enterprise Data Cloud - [link](http://vision.cloudera.com/cloudera-provides-first-look-at-cloudera-data-platform-the-industrys-first-enterprise-data-cloud/</li> <li>Datanami have their latest views on the future of <a href="/technologies/apache-hadoop">Hadoop</a> - [link](https://www.datanami.com/2019/06/24/hitting-the-reset-button-on-hadoop/</li> <li>Oracle have rationalised their analytics products under a single brand - ZDNet has the scoop - [link](https://www.zdnet.com/article/oracle-analytics-honing-18-products-down-to-a-single-brand/</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/06/26/the-mid-week-news/</guid> </item> <item><title>DataKitchen DataOps Platform</title><link>https://ondataengineering.github.io/technologies/datakitchen-devops-platform/</link><pubDate>Tue, 25 Jun 2019 08:00:00 +0000</pubDate> <description> <p>Platform to enable aoption of DataOps practices for data engineering, science, and analytic teams. These DataOps practices combine ideas in Agile development, DevOps, statistical process control, data science model deployment, and test data automation through a series of steps in a collaborative workflow. Within DataKitchen's product, users start with "Kitchens." Each Kitchen represents a place to do work: production environments, development sandboxes, etc. Kitchens are a collection of the data, data stores, tools (ETL, data science, visualization), code or configuration used by those tools, git branch, and the necessary servers and software. These collections can be created, merged, or shut down. A Kitchen can also be a current environment already available in the organization. Kitchens can be individual or shared across groups. When working in Kitchen, team members create and run "Recipes." Each Recipe is a directed graph of steps. A Recipe represents the workflow pipeline used to deliver analytics: acquire data, transform data, call a machine learning model, and visualize data. A Recipe utilized the tools that DataKitchen's customers already own. As Recipes are running, tests are embedded in the Recipe to detect values, ranges, distribution, frequency, implied and enforced integrity, and other business based checks on data or processing. "Order" metadata is created that is no just about lineage and descriptors of the data and jobs, but also includes statistics such as wall-clock time, processing requirements, test data outputs, and more. Alerts are delivered if selected tests fail. If not found, data and process errors reduce the business users trust. DataKitchen's customers see a meaningful reduction in the number of data errors, incorrect results, and late deliveries. Another DataKitchen goal is to reduce the time it takes to move changes from development into production. When business customers request new changes, DataKitchen can continually and automatically deploy those changes. A challenge is to make sure that those changes do not cause regressions, functional errors, or performance problems in production. The embedded Recipe tests serve a dual role in resolving this problem. In production environments, those tests provide surveillance and alerts, but in development, those tests make sure that any change in code or configuration does not cause a problem further down the pipeline. DataKitchen allows users on different teams and location to collaborate. One method is to use "Ingredients" to create sharable, reusable, component services. Multiple Recipes can call Ingredients, and they have a standard, function-like API. The platform was designed and implemented for secure multi-tenant multi-cloud, and multi-environment deployments. Finally, interfacing with Data Kitchen is supported by a user interface, command line or APIs. This is important because Data Kitchen can accept data and metadata from other processes you may already have in place. The DataKitchen DevOps Platform is a commercial product, available as a managed service with optional on-prem agent installation, and was first released in 2014.</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Type</td><td>Commercial</td></tr> <tr><td>Last Updated</td><td>June 2019</td></tr> </tbody> </table> <h2 id="product-links">Product Links</h2> <ul> <li><a href="https://www.datakitchen.io/platform.html">https://www.datakitchen.io/platform.html</a> - homepage</li> <li><a href="https://datakitchen.readme.io/docs">https://datakitchen.readme.io/docs</a> - documentation</li> </ul> <h2 id="dataops-links">DataOps Links</h2> <ul> <li><a href="https://www.dataopsmanifesto.org/">https://www.dataopsmanifesto.org/</a> - DataOps Manifesto</li> <li><a href="https://www.datakitchen.io/dataops-cookbook-main.html">https://www.datakitchen.io/dataops-cookbook-main.html</a> - DataOps Cookbook</li> <li><a href="https://paper.li/datakitchen_io/1525109892#/">https://paper.li/datakitchen_io/1525109892#/</a> - DataOps News</li> <li><a href="https://medium.com/data-ops">https://medium.com/data-ops</a> - DataOps Blog</li> <li><a href="https://www.youtube.com/playlist?list=PLVbsAdgZXvtyy6HVKCP0HChjCcq2oW3eK">https://www.youtube.com/playlist?list=PLVbsAdgZXvtyy6HVKCP0HChjCcq2oW3eK</a> - DataOps Videos</li> <li><a href="https://www.slideshare.net/datakitchen99">https://www.slideshare.net/datakitchen99</a> - DataOps SlideShare</li> </ul> <h2 id="news">News</h2> <ul> <li><a href="https://www.datakitchen.io/blog">https://www.datakitchen.io/blog</a> - DataKitchen blog</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/datakitchen-devops-platform/</guid> </item> <item><title>The Mid Week News 19/06/2019</title><link>https://ondataengineering.github.io/blog/2019/06/19/the-mid-week-news/</link><pubDate>Wed, 19 Jun 2019 07:30:00 +0000</pubDate> <description> <p>Once again with the weekly news - will it ever stop! Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-atlas/">Apache Atlas</a> 1.2 is out with a bunch of minor updates to the 1.x line. 2.0 came out last month if you’re looking for the latest version</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> 5.24 is out, with new versions of Flink, Presto, and Hue and Spark performance improvements</li> <li><a href="/technologies/apache-bigtop/">Apache Bigtop</a> 1.4 is out, with an Hadoop bump to 2.8.5 and a new integration test framework</li> </ul> <p>Other technology news:</p> <ul> <li>From Datanami - Attunity and WANdiscoLiveMigrator, and how they can help you migrate your data to the cloud - <a href="https://www.datanami.com/2019/06/12/great-cloud-migration-opens-data-opportunities/">link</a></li> <li>GridGain, commercial vendors for <a href="/technologies/apache-ignite/">Apache Ignite</a>, now includes an Hadoop Data Lake Accelerator in their commercial products, giving in memory bi-directional caching of your Hadoop data - <a href="https://www.gridgain.com/resources/blog/gridgain-data-lake-accelerator-released-today">link</a></li> <li>Datanami have some thoughts on Qubole Quantum, their new serverless query engine based on <a href="/technologies/presto/">Presto</a> - <a href="https://www.datanami.com/2019/06/13/serverless-sql-engine-targets-cloud-analytics/">link</a></li> <li>Starburst have announced that <a href="/technologies/presto/">Presto</a> how supports <a href="/technologies/databricks-delta/">Databricks Delta</a>, the newly open sourced data storage for Spark and Hadoop that supports ACID transactions - <a href="https://www.starburstdata.com/technical-blog/starburst-presto-databricks-delta-lake-support/">link</a></li> <li><a href="/tech-vendors/mapr/">MapR</a> have an update (that doesn’t say anything particularly new) on their financial situation - <a href="https://mapr.com/blog/mapr-update-june-13/">link</a></li> <li>From Datanami, how Teradata has moved from hardware to software, and from perpetual licences to subscription licences - <a href="https://www.datanami.com/2019/06/14/teradata-turns-40-takes-off-gloves-readies-for-a-fight/">link</a></li> <li>From Solutions Review, a summary of Gartner’s recent cool vendors for Data Management (DataDDO, DataKitchen, Panoply) - <a href="https://solutionsreview.com/data-management/gartner-names-3-cool-vendors-in-data-management-for-2019/">link</a></li> <li>Databricks have announced Databricks Connect, a library that allows seamless execution of <a href="/technologies/apache-spark">Spark</a> code into notebooks, IDEs and custom apps, with execution happening on a remote cluster - <a href="https://databricks.com/blog/2019/06/14/databricks-connect-bringing-the-capabilities-of-hosted-apache-spark-to-applications-and-microservices.html">link</a></li> <li>From ZDNet, <a href="/technologies/microsoft-azure-data-lake-store">Azure Data Lake Store</a> and <a href="/technologies/microsoft-azure-blob-storage">Azure Blob Storage</a> are now both supported by Okera if you’re looking for more granular data acccess controls - <a href="https://www.zdnet.com/article/azure-data-lake-storage-gets-okera-security-and-governance-platform-support/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/06/19/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 12/06/2019</title><link>https://ondataengineering.github.io/blog/2019/06/12/the-mid-weeks-news/</link><pubDate>Wed, 12 Jun 2019 07:30:00 +0000</pubDate> <description> <p>Time for the weekly news again. Remember, you can get daily news updates from our twitter feed (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/streamsets-data-collector/">StreamSets Data Collector</a> 3.9 is out</li> <li><a href="/technologies/apache-beam/">Apache Beam</a> 2.13 is out</li> </ul> <p>Other technology news:</p> <ul> <li>From Datanami - looks like both Tom Reilly and Mike Olson will be retiring from <a href="/tech-vendors/cloudera/">Cloudera</a> this summer - <a href="https://www.datanami.com/2019/06/06/cloudera-ceo-reilly-to-retire-after-poor-1q-results/">link</a></li> <li>Qubole have a new serverless <a href="/technologies/presto/">Presto</a> offering out called Quantum - pay per query and let them manage the Presto cluster - <a href="https://www.qubole.com/blog/technical-overview-quantum-serverless-engine/">link</a></li> <li>Interested in Hadoop <a href="/technologies/apache-hadoop/ozone/">Ozone</a> - the project has a new blog post on on securing it - <a href="https://blogs.apache.org/ozonesecurity/entry/security-in-apache-hadoop-ozone">link</a></li> <li>Cloudera have a bunch of performance metrics out relating to Erasure Coding in <a href="/technologies/apache-hadoop/hdfs/">HDFS</a> - <a href="https://blog.cloudera.com/blog/2019/06/hdfs-erasure-coding-in-production/">link</a></li> <li>From Datanami, an excellent article on the current state of the <a href="/technologies/apache-hadoop">Hadoop</a> market, and in particular <a href="/tech-vendors/cloudera/">Cloudera</a> and <a href="/tech-vendors/mapr/">MapR</a> - <a href="https://www.datanami.com/2019/06/10/hadoop-struggles-and-bi-deals-whats-going-on/">link</a></li> <li>Bloor have a new paper out on TestDataManagement, looking at the different methodologies - <a href="https://www.bloorresearch.com/research/test-data-management-2/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/06/12/the-mid-weeks-news/</guid> </item> <item><title>The Mid Week News 05/06/2019</title><link>https://ondataengineering.github.io/blog/2019/06/05/the-mid-week-news/</link><pubDate>Wed, 05 Jun 2019 07:30:00 +0000</pubDate> <description> <p>If you’ve been following us on twitter (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>) then you’ll have seen all this news already.</p> <p>If not - here’s your new weekly news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/qubole-data-service/">Qubole Data Service</a> R56 is out for AWS, Azure and Oracle</li> <li>Hortonworks (Cloudera) <a href="/technologies/hortonworks-dataplane/data-steward-studio/">Data Steward Studio</a> 1.5 is out</li> <li><a href="/technologies/apache-storm/">Apache Storm</a> has a big 2.0 release out</li> </ul> <p>Other technology news:</p> <ul> <li>From the code972 blog, thoughts on the AWS Open Distro for Elastic Search from someone who knows what they’re talking about. See also his posts on why you shouldn’t use AWS Elastic Service, and running <a href="/technologies/elasticsearch/">Elasticsearch</a> on Kubernetes - <a href="https://code972.com/blog/2019/03/116-dont-confuse-awss-open-distro-for-elasticsearch-with-altruism">link</a></li> <li>Amazon Managed Streaming for Apache Kafka is now generally available - <a href="https://aws.amazon.com/about-aws/whats-new/2019/05/amazon_managed_streaming_for_apache_kafka_amazon_msk_is_now_generally_available/">link</a>; <a href="https://www.infoq.com/news/2019/06/aws-managed-kafka-ga/">InfoQ</a></li> <li>I always through their tech looks pretty interesting, but it look’s like <a href="/tech-vendors/mapr/">MapR</a> are having financial troubles, although MapR have published a response - <a href="https://www.datanami.com/2019/05/30/after-funding-falls-through-mapr-seeks-a-buyer-to-avoid-shut-down/">Datanami</a>; <a href="https://mapr.com/blog/an-update-from-mapr/">MapR</a></li> <li>Are you into or looking at <a href="/technologies/apache-beam/">Apache Beam</a> - they’ve got a bunch of new Katas out if you’re looking to learn it - <a href="https://beam.apache.org/blog/2019/05/30/beam-kata-release.html">link</a></li> <li>From Qubole - scaling <a href="/technologies/apache-hive/">Apache Hive</a> through the use of a dedicate cluster for running HiveServer2 - <a href="https://www.qubole.com/blog/increase-scalability-of-hiveserver2/">link</a></li> <li>From Datanami - the latest on Snowflake, the cloud native analytical database - <a href="https://www.datanami.com/2019/05/29/snowflake-rides-cloud-wave-to-great-heights/">link</a></li> <li>From Qubole, and normally I’d not link to something like this, but this feels a reasonable stab at the difference between a Data Lake and a Data Warehouse - <a href="https://www.qubole.com/blog/data-lakes-vs-data-warehouses/">link</a></li> <li>From Datanami - how IBM is Turning DB2 into an AI database - <a href="https://www.datanami.com/2019/06/03/how-ibm-is-turning-db2-into-an-ai-database/">link</a></li> <li>From Cloudera, part 2 of their blog on the introduction of attribute based access control to <a href="/technologies/apache-solr/">Apache Solr</a> - <a href="https://blog.cloudera.com/blog/2019/06/cdh6-2-cloudera-search-attribute-based-access-control-part-2/">link</a></li> <li>From Elastic - how to secure your <a href="/technologies/elasticsearch/">Elasticsearch</a> cluster using all the security features now available for free - <a href="https://www.elastic.co/blog/tips-to-secure-elasticsearch-clusters-for-free-with-encryption-users-and-more">link</a></li> <li>Snowflake on <a href="/tech-vendors/google-cloud-platform/">Google Cloud Platform</a> has been announced - <a href="https://cloud.google.com/blog/products/data-analytics/announcing-snowflake-on-google-cloud-platform">link</a></li> <li>From Solutions Review, cloud data integrator Matillion has secured a bunch of VC funding. Anyone any experience with this, and happy to write us a tech summary? <a href="https://solutionsreview.com/data-integration/matillion-nabs-series-c-funding-for-cloud-data-warehouse-integration/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/06/05/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 29/05/2019</title><link>https://ondataengineering.github.io/blog/2019/05/29/the-mid-week-news/</link><pubDate>Wed, 29 May 2019 07:30:00 +0000</pubDate> <description> <p>It’s that time of the week again, so let’s look at the news.</p> <p>Next week however, we’re going to try something slightly difference. We’ll do the news daily (ish) over our twitter account (<a href="https://twitter.com/OnDataEng">@OnDataEng</a>) for those that like Twitter, with this weekly post then summarising it for those that like RSS/newsletters, with the format of this weekly post likely to change over time. <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-nifi/registry/">NiFi Registry</a> has hit 0.4</li> </ul> <p>Other technology news:</p> <ul> <li>Cloudera have a post on Conjuctive access control groups in <a href="/technologies/apache-solr">Solr</a> document level access control - <a href="https://blog.cloudera.com/blog/2019/05/cdh6-2-cloudera-search-attribute-based-access-control-part-1/">link</a></li> <li>Firstly from Confluent - <a href="/tech-categories/schema-registries/">Schema Registries</a> and why you need them, and an intro to <a href="/technologies/confluent-open-source">Confluent Schema Registry</a> - <a href="https://www.confluent.io/blog/schemas-contracts-compatibility">link</a></li> <li>And as a follow up from Confluent - 17 ways to mess up your <a href="/technologies/confluent-open-source">Confluent Schema Registry</a> - <a href="https://www.confluent.io/blog/17-ways-to-mess-up-self-managed-schema-registry">link</a></li> <li>From The Morning Paper - improving memory compression by compressing objects not cache lines - <a href="https://blog.acolyer.org/2019/05/24/zippads/">link</a></li> <li>Via Solutions review, Syncsort DMX Change Data Capture is now Syncsort Connect CDC - <a href="https://blog.syncsort.com/2019/05/big-data/connect-cdc-real-time-streaming-data/">link</a>; <a href="https://www.syncsort.com/en/products/Connect-CDC">home page</a></li> <li>Autoscale is now in preview with <a href="/technologies/azure-hdinsight/">Azure HDInsight</a> - <a href="https://azure.microsoft.com/en-gb/blog/drive-higher-utilization-of-azure-hdinsight-clusters-with-autoscale/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/05/29/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 22/05/2019</title><link>https://ondataengineering.github.io/blog/2019/05/22/the-mid-week-news/</link><pubDate>Wed, 22 May 2019 07:30:00 +0000</pubDate> <description> <p>Once again, it’s time for the mid week news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-atlas/">Apache Atlas</a> has hit 2.0</li> <li><a href="/technologies/apache-avro/">Apache Avro</a> has hit 1.9</li> <li><a href="/technologies/apache-solr/">Apache Solr</a> has hit 8.1</li> <li><a href="/technologies/greenplum/">Greenplum</a> has hit 5.19</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> and <a href="/technologies/elasticsearch-hadoop/">Elasticsearch Hadoop</a> has been bumped to 6.8 and 7.1</li> <li><a href="/technologies/greenplum/">Greenplum</a> have more details on their 6.0 release</li> </ul> <p>Other technology news:</p> <ul> <li>Let’s call this out explicitly, but the 6.8 and 7.1 Elastic releases are solely to make security (TLS and RBAC, which were open sourced not so long ago) free, presumably in response to the new Elastic Open Distro - <a href="https://www.elastic.co/blog/security-for-elasticsearch-is-now-free">link</a></li> <li>Also from Elastic, they’ve announced Elastic Cloud on Kubernetes (ECK) - <a href="https://www.elastic.co/blog/introducing-elastic-cloud-on-kubernetes-the-elasticsearch-operator-and-beyond">link</a></li> <li>And from DataStax, they’ve announced Constellation, their <a href="/technologies/apache-cassandra">Cassandra</a> as a service offering - <a href="https://www.datastax.com/2019/05/datastax-announces-constellation-a-cloud-native-data-platform">link</a>; <a href="https://www.zdnet.com/article/datastax-constellation-apache-cassandra-as-a-service-announced/">ZDNet</a></li> <li>From ZDNet, the future of SAP Hana - <a href="https://www.zdnet.com/article/the-future-of-sap-hana/">link</a></li> <li>Latest updates from Microsoft on Cosmos DB - <a href="https://azure.microsoft.com/en-gb/blog/a-cosmonaut-s-guide-to-the-latest-azure-cosmos-db-announcements/">link</a></li> <li>And from Datanami - how ScyllaDB (an Apache Cassandra-compatible database) handles both OLAP and OLTP - <a href="https://www.datanami.com/2019/05/14/scylladb-gives-cohabitation-of-olap-and-oltp-a-shot/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/05/22/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 15/05/2019</title><link>https://ondataengineering.github.io/blog/2019/05/15/the-mid-week-news/</link><pubDate>Wed, 15 May 2019 07:30:00 +0000</pubDate> <description> <p>It feel’s like deja-vu, but it’s time for the news again… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-oozie/">Apache Oozie</a> is up to 0.4-alpha</li> <li><a href="/technologies/qubole-data-service/">Qubole Data Service</a> is up to R56 for AWS</li> </ul> <p>Other technology news:</p> <ul> <li>From the <a href="/technologies/apache-flink/">Apache Flink</a> blog - using temporal tables in their Streaming SQL and Table API - <a href="https://flink.apache.org/2019/05/14/temporal-tables.html">link</a></li> <li>From Elastic - field aliases in <a href="/technologies/elasticsearch/">Elasticsearch</a> - <a href="https://www.elastic.co/blog/introducing-field-aliases-in-elasticsearch">link</a></li> <li>Also from Elastic this week - common performance issues with <a href="/technologies/elasticsearch/">Elasticsearch</a> and how to fix them - <a href="https://www.elastic.co/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries">link</a></li> <li>From Cloudera - information on MirrorMaker 2, replicating data between <a href="/technologies/apache-kafka">Kafka</a>ka instances based on Kafka Connect - <a href="https://blog.cloudera.com/blog/2019/05/kafka-replication-the-case-for-mirrormaker-2/">link</a></li> <li>And also from Cloudera, a couple of <a href="/technologies/apache-hadoop">Hadoop</a> / <a href="/technologies/apache-hive">Hive</a> posts on managing small files and partitions in Hive - <a href="https://blog.cloudera.com/blog/2019/05/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/">small files</a>; <a href="https://blog.cloudera.com/blog/2019/05/partition-management-in-hadoop/">partition management</a></li> <li>From Microsoft - introducing Azure SQL Database Edge - <a href="https://azure.microsoft.com/en-gb/blog/azure-sql-database-edge-enabling-intelligent-data-at-the-edge/">link</a>; <a href="https://www.infoq.com/news/2019/05/Azure-SQL-Database-Edge">InfoQ</a></li> <li>And more from Microsoft - new performance and capability fixes for Azure SQL Data Warehouse - <a href="https://azure.microsoft.com/en-gb/blog/azure-sql-data-warehouse-releases-new-capabilities-for-performance-and-security/">link</a></li> <li>ZDNet and Datanami have updates on Confluent Cloud - <a href="https://www.zdnet.com/article/confluent-makes-apache-kafka-cloud-native/">ZDNet</a>; <a href="https://www.datanami.com/2019/05/13/kafka-in-the-cloud-who-needs-clusters-anyway/">Datanami</a></li> <li>From LinkedIn, an introduction to Ambry - their “distributed, highly available and horizontally scalable immutable object store optimized to store and serve media” - <a href="https://engineering.linkedin.com/blog/2019/05/introducing-data-compaction-in-ambry">link</a></li> <li>From Confluent, a couple of posts on managing multi-clusters with Confluent Control Centre and consuming data from <a href="/technologies/apache-kafka">Kafka</a> - <a href="https://www.confluent.io/blog/dawn-of-kafka-devops-managing-multi-cluster-kafka-connect-and-ksql-with-confluent-control-center">managing multi-clusters</a>; <a href="https://www.confluent.io/blog/apache-kafka-data-access-semantics-consumers-and-membership">data access</a></li> <li>And lastly, from Influx Data - getting started with <a href="/technologies/influxdb/">InfluxDB</a> 2.0 and announcing their InfluxDB Cloud 2.0 Beta Free Tier - <a href="https://www.influxdata.com/blog/getting-started-with-influxdb-2-0-scraping-metrics-running-telegraf-querying-data-and-writing-data/">getting started</a>; <a href="https://www.influxdata.com/blog/announcing-influxdb-cloud-2-0-beta-free-tier/">influxdb cloud free tier</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/05/15/the-mid-week-news/</guid> </item> <item><title>Google Cloud Composer</title><link>https://ondataengineering.github.io/technologies/google-cloud-composer/</link><pubDate>Tue, 14 May 2019 00:00:00 +0000</pubDate> <description> <p>Managed workflow orchestration service built on Apache Airflow that's designed for running data integration tasks on a repeated schedule. Implemented on a micro-service architecture, the Airflow database and web server are implemented on App Engine and access protected using Identify-Aware Proxy (an enterprise security model that enables employees to work from untrusted networks without the use of a VPN), while the scheduler, executor and worker nodes are implemented on Kubernetes Engine. Integrated with Cloud Storage for staging DAGs, plugins, data dependencies and Stackdriver for real-time logging and monitoring of Airflow service and workflow logs. Manageable via a web (Cloud Platform Console and Airflow web interfaces), command line interface (Cloud SDK) or an RPC and REST API. Allows custom Airflow plugins and Python dependencies from the Python Package Index to be installed. Priced an an hourly rate (charged per minute) based on the size of a Cloud Composer environment, which is in addition to any Kubernetes Engine, Compute Engine or Persistent Disk and network egress charges.</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Other Names</td><td>Cloud Composer</td></tr> <tr><td>Type</td><td>Commercial</td></tr> <tr><td>Last Updated</td><td>May 2019 - v1.3</td></tr> </tbody> </table> <h2>Related Technologies</h2> <table> <tbody> <tr><td>Packages</td><td><a href="/technologies/apache-airflow/">Apache Airflow</a></td></tr> </tbody> </table> <h2 id="links">Links</h2> <ul> <li><a href="https://cloud.google.com/composer/">https://cloud.google.com/composer/</a> - homepage</li> <li><a href="https://cloud.google.com/composer/docs/concepts/versioning/composer-versions">https://cloud.google.com/composer/docs/concepts/versioning/composer-versions</a> - bundle services version list</li> <li><a href="https://cloud.google.com/composer/docs/release-notes">https://cloud.google.com/composer/docs/release-notes</a> - release notes</li> <li><a href="https://cloud.google.com/composer/docs/">https://cloud.google.com/composer/docs/</a> - documentation</li> <li><a href="https://github.com/GoogleCloudPlatform/python-docs-samples/">https://github.com/GoogleCloudPlatform/python-docs-samples/</a> - sample Composer workflows</li> </ul> <h2 id="news">News</h2> <p>See <a href="/tech-vendors/google-cloud-platform/">Google Cloud Platform</a> updates</p> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/google-cloud-composer/</guid> </item> <item><title>Apache Airflow</title><link>https://ondataengineering.github.io/technologies/apache-airflow/</link><pubDate>Tue, 14 May 2019 00:00:00 +0000</pubDate> <description> <p>A workflow management system designed for orchestrating repeated data integration tasks on a schedule, with workflows configured in Python as a Directed Acyclic Graph (DAG) of tasks. A scheduler is responsible for identifying tasks to be run, with an executor responsible for determining where tasks should run (with support for local execution or remote execution using Celery, Dask, Mesos and Kubernetes, with the ability to define custom executors). Supports periodic execution of workflows (based on a schedule interval), sensor operators (that wait until some condition is true, e.g. a file exists), automatic retry of failed tasks, catchup of historic task executions, task templating, triggers and complex dependancies, shared connection configuration, configurable job parallelism, variables that can be configured through the UI, re-usable sub DAGs and experimental support for data lineage with integration to Apache Atlas. Packaged with a wide variety of prebuilt 'operators' for data integration; databases (MySQL, PostgreSQL, Oracle), Hadoop (Hive, Pig, Sqoop) and cloud services (Amazon Web Services, Google Cloud Platform and Microsoft Azure services), with the ability to write your own. Comes with a command line and web interface to manage and monitor workflows and perform administrative actions on the environment and an experimental REST API. Persists workflow management state and operational metadata in either a MySQL or PostgreSQL relational database and queryable using SQL via the web interface to create simple charts. Includes a security model with support for a range of authentication methods including LDAP, Kerberos (limited), OAuth and Google Authentication. Originally developed at Airbnb and donated to the Apache Foundation's incubator program in June 2015. Under active development with a wide range of contributors. Commercial support is available from a variety of vendors who distribute it as a standalone managed service (Astronomer and Google), to run on Kubernetes (Astronomer), or part of wider managed data service offering (Qubole).</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Other Names</td><td>Airflow</td></tr> <tr><td>Vendors</td><td><a href="/tech-vendors/apache/">The Apache Software Foundation</a>, <a href="/tech-vendors/google-cloud-platform/">Google Cloud Platform</a>, Astronomer</td></tr> <tr><td>Type</td><td>Commercial Open Source</td></tr> <tr><td>Last Updated</td><td>May 2019 - 1.10</td></tr> </tbody> </table> <h2>Related Technologies</h2> <table> <tbody> <tr><td>Is packaged by</td><td><a href="/technologies/qubole-data-service/">Qubole Data Service</a>, <a href="/technologies/google-cloud-composer/">Google Cloud Composer</a></td></tr> </tbody> </table> <h2 id="release-history">Release History</h2> <table> <tbody> <tr> <td>version</td> <td>release date</td> <td>release links</td> <td>release comment</td> </tr> <tr> <td>1.7</td> <td>2016-03-28</td> <td><a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements#Announcements-March28,2016">announcement</a></td> <td> </td> </tr> <tr> <td>1.8</td> <td>2017-03-19</td> <td><a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements#Announcements-March19,2017">announcement</a></td> <td> </td> </tr> <tr> <td>1.9</td> <td>2018-01-02</td> <td><a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements#Announcements-Jan2,2018">announcement</a></td> <td> </td> </tr> <tr> <td>1.10</td> <td>2018-08-20</td> <td><a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements#Announcements-Aug20,2018">announcement</a></td> <td> </td> </tr> </tbody> </table> <h2 id="links">Links</h2> <ul> <li><a href="https://airflow.apache.org">https://airflow.apache.org</a> - documentation</li> <li><a href="https://github.com/jghoman/awesome-apache-airflow">https://github.com/jghoman/awesome-apache-airflow</a> - a curated list of resources about Apache Airflow (incubating)</li> </ul> <h2 id="news">News</h2> <ul> <li><a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements">https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements</a> - announcements</li> <li><a href="https://github.com/apache/incubator-airflow/releases">https://github.com/apache/incubator-airflow/releases</a> - details of releases</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/apache-airflow/</guid> </item> <item><title>The Mid Week News 08/05/2019</title><link>https://ondataengineering.github.io/blog/2019/05/08/the-mid-week-news/</link><pubDate>Wed, 08 May 2019 07:30:00 +0000</pubDate> <description> <p>It’s news time again, and we’ve used a real date this time… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-drill/">Apache Drill</a> as hit 1.16</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> is up to 5.23</li> <li><a href="/technologies/amazon-s3/">Amazon S3</a> now supports <a href="https://aws.amazon.com/blogs/aws/new-amazon-s3-batch-operations/">Batch Operations</a></li> </ul> <p>Other technology news:</p> <ul> <li>ZDNet have all the data updates from <a href="/tech-vendors/microsoft-azure/">Microsoft</a> Build - <a href="https://www.zdnet.com/article/microsoft-builds-its-data-story-in-the-cloud-and-at-the-edge/">link</a></li> <li>From Microsoft, details on the latest updates to Azure Data Factory and Azure SQL Data Warehouse - <a href="https://azure.microsoft.com/en-gb/blog/analytics-in-azure-remains-unmatched-with-new-innovations/">link</a></li> <li>From Google, how Google Cloud DataPrep (based on Trifacta) helps you manage Data Quality - <a href="https://cloud.google.com/blog/products/gcp/improving-data-quality-for-machine-learning-and-analytics-with-cloud-dataprep">link</a></li> <li>Bloor have all the latest on Teradata Vantage, and their split of storage and computer - <a href="https://www.bloorresearch.com/2019/05/teradata-vantage/">link</a></li> <li>From Datanami - some people have got very large cloud bills - <a href="https://www.datanami.com/2019/05/02/cloud-analytics-proving-costly-for-some/">link</a></li> <li><a href="/tech-vendors/google-cloud-platform/">Google</a> have some best practices for data governance in the cloud - <a href="https://cloud.google.com/blog/products/data-analytics/principles-and-best-practices-for-data-governance-in-the-cloud">link</a></li> <li>Microsoft have announced high performance C# and F# <a href="/technologies/apache-spark">Spark</a> libraries - <a href="https://www.infoq.com/news/2019/04/microsoft-net-apache-spark">link</a></li> <li>eBay have announced Beam, an open source distributed RDF store - <a href="https://github.com/eBay/beam">link</a></li> <li>Confluent have a post on optimising Kafka Streams applications - <a href="https://www.confluent.io/blog/optimizing-kafka-streams-applications">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/05/08/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 31/04/2019</title><link>https://ondataengineering.github.io/blog/2019/05/01/the-mid-week-news/</link><pubDate>Wed, 01 May 2019 07:30:00 +0000</pubDate> <description> <p>One big announcement this week, but otherwise it’s another very quiet one… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-beam/">Apache Beam</a> has hit 2.21</li> </ul> <p>Other technology news:</p> <ul> <li>The big news this week is that Databricks have open sourced <a href="/technologies/databricks-delta/">Databricks Delta</a>, their transactional data lake storage layer that runs over HDFS/S3 - <a href="https://databricks.com/blog/2019/04/24/open-sourcing-delta-lake.html">announcement</a>; <a href="https://delta.io">home page</a>; <a href="https://www.datanami.com/2019/04/24/databricks-donates-delta-code-to-open-source">Datanami</a>; <a href="https://www.zdnet.com/article/a-standard-for-storing-big-data-apache-spark-creators-release-open-source-delta-lake/">ZDNet</a></li> <li>Datanami have an interesting article on Data Pipeline Automation, with a range of tools referenced that we should probably look at at some point - <a href="https://www.datanami.com/2019/04/24/data-pipeline-automation-the-next-step-forward-in-dataops">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/05/01/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 24/04/2019</title><link>https://ondataengineering.github.io/blog/2019/04/24/the-mid-week-news/</link><pubDate>Wed, 24 Apr 2019 07:30:00 +0000</pubDate> <description> <p>It’s a light week for the news, but let’s have a quick look anyway… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li>None this week</li> </ul> <p>Other technology news:</p> <ul> <li>ODBMS Industry Watch has an interview with Merv Adrian of Gartner on his views of the current database market, Hadoop and the Cloud - <a href="http://www.odbms.org/blog/2019/04/on-the-database-market-interview-with-merv-adrian/">link</a></li> <li>An nice deeply technical blog post from the <a href="/technologies/pravega/">Pravega</a> team on how it provides performant support for events of any size - <a href="http://blog.pravega.io/2019/04/22/events-big-or-small-bring-them-on/">link</a></li> <li>From Cloudera - fine grained access control on Impala over Kudu - <a href="https://blog.cloudera.com/blog/2019/04/fine-grained-authorization-with-apache-kudu-and-impala/">link</a></li> <li>From the Knoldus blog - are Knowledge Graphs (see our <a href="/tech-categories/graph-databases/">Graph Databases</a> tech category) the future of Data Lakes - <a href="https://blog.knoldus.com/are-knowledge-graphs-the-future-of-data-lakes/">links</a></li> <li>Microsoft have open sourced Data Accelerator for <a href="/technologies/apache-spark/">Apache Spark</a>, a tool for simplifying creating streaming pipelines - <a href="https://cloudblogs.microsoft.com/opensource/2019/04/16/microsoft-open-sources-data-accelerator-for-apache-spark/">link</a></li> <li>Google have a bunch of updates they’ve announced for <a href="/technologies/google-cloud-dataproc/">Google Cloud DataProc</a> - <a href="https://cloud.google.com/blog/products/data-analytics/new-open-source-tools-in-cloud-dataproc-process-data-at-cloud-scale">link</a></li> <li>From Elastic, what’s new in Lucene 8 - <a href="https://www.elastic.co/blog/whats-new-in-lucene-8">link</a></li> <li>From DataBricks, using Spark with <a href="/technologies/amazon-s3">S3</a> Select - <a href="https://databricks.com/blog/2019/04/17/running-peta-scale-spark-jobs-on-object-storage-using-s3-select.html">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/04/24/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 17/04/2019</title><link>https://ondataengineering.github.io/blog/2019/04/17/the-mid-week-news/</link><pubDate>Wed, 17 Apr 2019 07:30:00 +0000</pubDate> <description> <p>Tick tock, tick tock - it’s time for the news again, and there’s a big bunch of announcements this week… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> and <a href="/technologies/elasticsearch-hadoop/">Elasticsearch-Hadoop</a> have hit the big 7.0</li> <li><a href="/technologies/elastic-cloud/">Elastic Cloud</a> has hit 2.2</li> <li><a href="/technologies/azure-hdinsight/">Azure HDInsight</a> has hit 4.0 (Hadoop 3.x)</li> <li><a href="/technologies/qubole-data-service/">Qubole Data Service</a> not supports GCP</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> has hit 5.22</li> <li><a href="/technologies/apache-flink/">Apache Flink</a> has hit 1.8</li> <li><a href="/technologies/apache-myriad/">Apache Myriad</a> has hit 0.3</li> <li><a href="/technologies/apache-kylin/">Apache Kylin</a> has a 3.0-alpha out</li> <li><a href="/technologies/influxdb/">InfluxDB</a> has hit 2.0 alpha 8</li> </ul> <p>Other technology news:</p> <ul> <li>Cloudera Flow Management and Cloudera Edge Management (part of the new Cloudera DataFlow, nee <a href="/technologies/hortonworks-data-flow">HDF</a>) are now generally available - <a href="http://vision.cloudera.com/announcing-the-general-availability-of-cloudera-flow-management-and-cloudera-edge-management/">link</a></li> <li>Also from Cloudera, an update on Workload XM, and it’s beautiful visualisations of Spark jobs - <a href="https://blog.cloudera.com/blog/2019/04/demystifying-spark-jobs-to-optimize-for-cost-and-performance/">link</a></li> <li>There’s a bunch of stuff out of the recent Google NEXT conference (thanks to longtime friend of the site Jeff for the heads-up) - <a href="https://cloud.google.com/blog/products/data-analytics/google-cloud-smart-analytics-accelerates-your-business-transformation">announcement</a>; <a href="https://www.datanami.com/2019/04/10/google-cloud-unveils-slew-of-new-data-management-and-analytics-services/">Datanami view</a> <ul> <li>Cloud Data Fusion (beta) - a managed version of Cask Data Application Platform (CDAP) - an open source framework for building data analytic applications, which Google acquired back in May 2018</li> <li>Big Query BI Engine (beta) - in memory analysis service that accelerates query response times from BigQuery for Looker, Google Data Studio, and Google Sheets</li> <li>Connected Sheets (beta) - ability to access BigQuery through Google Sheets</li> <li>Data Catalog (beta) - managed metadata management service</li> </ul> </li> <li>ZDNet have a couple of good summaries of the open source partnership announcements at Google NEXT, including their thoughts on what this means for open source - <a href="https://www.zdnet.com/article/google-cloud-next-19-postmortem-thomas-kurian-wants-the-right-people/">link</a>; <a href="https://www.zdnet.com/article/google-cloud-gives-open-source-data-vendors-a-break-will-that-save-open-source/">link</a></li> <li>Datanami have a story on a new high performance massively parallel in memory graph database from Cray alumni called Trovares xGT - <a href="https://www.datanami.com/2019/04/15/startup-trovares-brings-hpc-to-graph-analytics/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/04/17/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 10/04/2019</title><link>https://ondataengineering.github.io/blog/2019/04/10/the-mid-week-news/</link><pubDate>Wed, 10 Apr 2019 07:30:00 +0000</pubDate> <description> <p>Where does time go? It’s time for the news again… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-arrow/">Apache Arrow</a> 0.13 is out</li> <li><a href="/technologies/greenplum/">Greenplum</a> 5.18 is out</li> <li><a href="/technologies/cloudera-altus/director/">Cloudera Altus Director</a> 6.2 is out</li> </ul> <p>Other technology news:</p> <ul> <li>Google have decided to partner with a bunch of Open Source ISVs such as Elastic on Confluent rather than create their own offerings based on the open source products - <a href="https://www.datanami.com/2019/04/09/google-extends-olive-branch-to-open-source-tech/">Datanami</a>; <a href="https://www.theregister.co.uk/2019/04/09/google_cloud_keynote/">The Register</a>; <a href="https://www.elastic.co/blog/elastic-and-google-team-up-to-bring-a-more-native-elasticsearch-service-experience-on-google-cloud">Elastic</a>; <a href="https://www.confluent.io/blog/announcing-confluent-cloud-for-apache-kafka-native-service-on-google-cloud-platform">Confluent</a></li> <li><a href="/technologies/databricks-delta/">Databricks Delta</a> has gone GA - <a href="https://databricks.com/blog/2019/04/04/announcing-databricks-runtime-5-3.html">blog</a>; <a href="https://www.datanami.com/2019/04/08/how-databricks-keeps-data-quality-high-with-delta/">Datanami</a></li> <li>From Datanami - Apache Spark is Great, But Not Perfect - <a href="https://www.datanami.com/2019/04/03/apache-spark-is-great-but-its-not-perfect/">link</a></li> <li>From ZDNet - where is Confluent Going? - <a href="https://www.zdnet.com/article/where-is-confluent-going/">link</a></li> <li>Microsoft are crowing about their performance, price and security for their Azure analytics services again - <a href="https://azure.microsoft.com/en-gb/blog/want-to-evaluate-your-cloud-analytics-provider-here-are-the-three-questions-to-ask/">link</a></li> <li>The Register has an update on Teradata’s move to the cloud - <a href="https://www.theregister.co.uk/2019/04/08/teradata_vantage_as_a_service/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/04/10/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 03/04/2019</title><link>https://ondataengineering.github.io/blog/2019/04/03/the-mid-week-news/</link><pubDate>Wed, 03 Apr 2019 07:30:00 +0000</pubDate> <description> <p>Some bigish releases this week. Let’s delve into the news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/cloudera-cdh/">CDH</a> 6.2 is out, including new releases of <a href="/technologies/cloudera-manager/">Cloudera Manager</a> and <a href="/technologies/cloudera-navigator/">Cloudera Navigator</a></li> <li><a href="/technologies/apache-impala/">Apache Impala</a> 3.2 is out</li> <li><a href="/technologies/apache-kafka/">Apache Kafka</a> 2.2 is out</li> <li><a href="/technologies/hue/">Hue</a> 4.4 is out</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> 7.0 RC1 is out</li> <li>Confluent <a href="/technologies/confluent-open-source/">Open Source</a> and <a href="/technologies/confluent-enterprise">Enterprise</a> have hit 5.2</li> <li>Hortonworks DataPlane <a href="/technologies/hortonworks-dataplane/data-lifecycle-manager/">Data Lifecycle Manager</a> and <a href="/technologies/hortonworks-dataplane/data-steward-studio/">Data Steward Studio</a> are both up to 1.4</li> </ul> <p>Other technology news:</p> <ul> <li>Cloudera have announced Cloudera DataFlow (their new name for <a href="/technologies/hortonworks-data-flow">HDF</a>), with a couple of new products - <a href="http://vision.cloudera.com/introducing-cloudera-edge-management-and-cloudera-flow-management/">link</a></li> <li>MapR have announced key new features in their next <a href="/technologies/mapr-converged-data-platform/">MapR Converged Data Platform</a> release, including support for Spark and Drill on Kubernetes, and support for their filesystem as a Kubernetes persistent volume - <a href="https://mapr.com/products/whats-new/compute-storage/">announcement</a>; <a href="https://mapr.com/blog/deploying-native-spark-and-drill-applications-in-kubernetes-just-got-easier/">blog post</a>; <a href="https://www.theregister.co.uk/2019/04/02/mapr_kubernetes_launch/">The Register</a>; <a href="https://www.zdnet.com/article/mapr-brings-tight-integration-with-kubernetes/">ZDNet</a></li> <li>Cloudera’s annual report is out, and The Register have their summary - <a href="https://www.theregister.co.uk/2019/04/01/cloudera_annual_report/">link</a></li> <li>Datanami have an interview with Doug Cutting - <a href="https://www.datanami.com/2019/04/01/heres-what-doug-cutting-says-is-hadoops-biggest-contribution">link</a></li> <li>From The Morning Paper - Calvin - fast distributed transactions for partitioned database systems - <a href="https://blog.acolyer.org/2019/03/29/calvin-fast-distributed-transactions-for-partitioned-database-systems/">link</a></li> <li><a href="/technologies/microsoft-azure-blob-storage">Azure Blob Storage</a> lifecycle management is generally available - <a href="https://azure.microsoft.com/en-gb/blog/azure-blob-storage-lifecycle-management-now-generally-available/">link</a></li> <li>Also on <a href="/technologies/microsoft-azure-blob-storage">Azure Blob Storage</a>, high performance block blobs are now aailable across all storage tiers - <a href="https://azure.microsoft.com/en-gb/blog/high-throughput-with-azure-blob-storage/">link</a></li> <li>Amazon have announced a new Glacier Deep Storage storage class for <a href="/technologies/amazon-s3/">Amazon S3</a> - <a href="https://aws.amazon.com/blogs/aws/new-amazon-s3-storage-class-glacier-deep-archive/">link</a></li> <li><a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-0212">CVE-2019-0212</a> - <a href="/technologies/apache-hbase/">Apache HBase</a> 2.x incorrectly applied Kerberos authorization to users of the HBase REST server</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/04/03/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 27/03/2019</title><link>https://ondataengineering.github.io/blog/2019/03/27/the-mid-week-news/</link><pubDate>Wed, 27 Mar 2019 07:30:00 +0000</pubDate> <description> <p>Wow, yet another week has passed. Let’s peek at the news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-calcite/">Apache Calcite</a> is up to 1.19</li> <li><a href="/technologies/apache-nifi/minifi/">Apache NiFi MiNiFi C++</a> is up to 0.6</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> and <a href="/technologies/elasticsearch-hadoop/">Elasticsearch-Hadoop</a> have hit 6.7</li> </ul> <p>Other technology news:</p> <ul> <li>There’s (always) more Cloudera (Hortonworks) information, this time on their new data platform from the recent DataWorks Summit - <a href="https://www.theregister.co.uk/2019/03/20/cloudera_data_platform_details/">The Register summary</a>; <a href="https://www.theregister.co.uk/2019/03/21/cloudera_mick_hollison_interview/">The Register interview</a>; <a href="https://www.zdnet.com/article/the-new-cloudera-hortonworks-hadoop-100-open-source-50-boring/">ZDNet</a></li> <li>On the continuing story of open source licences, the latest take from ZDNet - <a href="https://www.zdnet.com/article/open-source-growing-pains-is-open-core-the-answer/">link</a></li> <li>From Confluence, doing watermarks and triggers in <a href="/technologies/apache-kafka/kafka-streams/">Kafka Streams</a> which doesn’t believe in them - <a href="https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers">link</a></li> <li>And again from Confluence, distributed tracing of <a href="/technologies/apache-kafka">Kafka</a> based solutions using Zipkin - <a href="https://www.confluent.io/blog/importance-of-distributed-tracing-for-apache-kafka-based-applications">link</a></li> <li>Azure Premium <a href="/technologies/microsoft-azure-blob-storage">Blob Storage</a> is now generally available - <a href="https://azure.microsoft.com/en-gb/blog/azure-premium-block-blob-storage-is-now-generally-available/">link</a></li> <li>Short thoughts from Datanmi on <a href="/technologies/greenplum/">Greenplum</a> 6 - <a href="https://www.datanami.com/2019/03/20/pivotal-extends-greenplum-postgresql-on-cloud-foundry/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/03/27/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 20/03/2019</title><link>https://ondataengineering.github.io/blog/2019/03/20/the-mid-week-news/</link><pubDate>Wed, 20 Mar 2019 07:30:00 +0000</pubDate> <description> <p>Ok, let’s scour the interwebs for the latest news: <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-kudu">Apache Kudu</a> is up to 1.9</li> <li><a href="/technologies/apache-solr/">Apache Solr</a> has hit the big 8.0</li> <li><a href="/technologies/greenplum/">Greenplum</a> has a 6.x beta out</li> <li><a href="/technologies/hortonworks-data-flow/">Hortonworks DataFlow</a> is up to 3.4</li> <li><a href="/technologies/qubole-data-service/">Qubole Data Service</a> is now up to R55 for Azure and Oracle clouds</li> <li>Hortonworks <a href="/technologies/schema-registry/">Schema Registry</a> is up to 0.7</li> <li><a href="/technologies/streamsets-data-collector/">StreamSets Data Collector</a> is up to 3.8</li> </ul> <p>Other technology news:</p> <ul> <li>From Pivotal, a reminder that <a href="/technologies/greenplum/">Greenplum</a> is stil out there and awesome - <a href="https://content.pivotal.io/blog/pivotal-greenplum-postgres">link</a></li> <li>From The Register, a report on Cloudera cloud strategy - <a href="https://www.theregister.co.uk/2019/03/14/cloudera_enterprise_data_cloud/">link</a></li> <li>Azure Databricks now supports <a href="/technologies/databricks-delta/">Delta</a>, GitHub integration and deployment in Azure virtal networks - <a href="https://azure.microsoft.com/en-gb/blog/azure-databricks-vnet-injection-devops-version-control-and-delta-availability/">link</a></li> <li>And on <a href="/technologies/databricks-delta/">Databricks Delta</a>, there’s a new blog post on performing upserts - <a href="https://databricks.com/blog/2019/03/18/efficient-upserts-into-data-lakes-with-databricks-delta.html">link</a></li> <li>You can now migrate Oracle ETL jobs to AWS Glue - <a href="https://aws.amazon.com/about-aws/whats-new/2019/03/aws-schema-conversion-tool-adds-support-for-migrating-oracle-etl/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/03/20/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 13/03/2019</title><link>https://ondataengineering.github.io/blog/2019/03/13/the-mid-week-news/</link><pubDate>Wed, 13 Mar 2019 07:30:00 +0000</pubDate> <description> <p>Roll up for this weeks news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-beam/">Apache Beam</a> has hit 2.11</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> is up to 5.21</li> </ul> <p>Other technology news:</p> <ul> <li>This is the big news of the week - AWS have forked <a href="/technologies/elasticsearch/">Elasticsearch</a> (although they claim it’s not a fork) as the Open Distro for Elasticsearch, and have added in a bunch of altering, SQL, security, audit and fine-grained access control features - <a href="https://opendistro.github.io/for-elasticsearch/">homepage</a> <ul> <li>From AWS, their rationale in a couple of blog posts - <a href="https://aws.amazon.com/blogs/aws/new-open-distro-for-elasticsearch/">announcement</a>; <a href="https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/">rationale</a></li> <li>Response from Elastic - <a href="https://www.elastic.co/blog/on-open-distros-open-source-and-building-a-company">link</a></li> <li>Analysis from a bunch of sites - <a href="https://www.theregister.co.uk/2019/03/12/aws_elasticsearch_distro/">The Register</a>; <a href="https://www.datanami.com/2019/03/12/search-war-unfolding-for-control-of-elasticsearch/">Datanami</a>; <a href="https://www.influxdata.com/blog/aws-intends-for-their-new-project-to-be-an-elasticsearch-fork/">Influx</a></li> <li>And we’ve posted these previous, but some background on the recent noise about cloud providers killing open source - <a href="https://www.infoq.com/articles/will-cloud-computing-kill-open-source">InfoQ</a>, <a href="https://www.datanami.com/2018/12/24/cloud-backlash-grows-as-open-source-gets-less-open/">Datanami</a>; <a href="https://www.influxdata.com/blog/copyleft-and-community-licenses-are-not-without-merit-but-they-are-a-dead-end/">Influx</a>; <a href="https://www.confluent.io/blog/license-changes-confluent-platform">Confluent</a></li> </ul> </li> <li>From ZDNet, and excellent article on standardisation coming to <a href="/tech-categories/graph-databases/">Graph</a> and <a href="/tech-categories/rdf-databases/">RDF</a> - <a href="https://www.zdnet.com/article/graph-data-standardization-its-just-a-graph-making-gravitational-waves-in-the-real-world/">link</a></li> <li>Starburst (the commercial product based on <a href="/technologies/presto/">Presto</a>) has a new major release - <a href="https://www.starburstdata.com/technical-blog/announcing-starburst-enterprise-302e-with-mission-control/">link</a></li> <li>Via DZone, a pretty good article (because it doesn’t talk about features) on choosing between <a href="/technologies/apache-solr">Solr</a> and <a href="/technologies/elasticsearch/">Elasticsearch</a> - <a href="https://opensourceconnections.com/blog/2019/02/28/stop-worrying-solr-elasticsearch/">link</a></li> <li>Datanami has a couple of blogs on Spark and why it’s been successful and is growing - <a href="https://www.datanami.com/2019/03/11/what-makes-apache-spark-sizzle-experts-sound-off/">link</a>; <a href="https://www.datanami.com/2019/03/08/a-decade-later-apache-spark-still-going-strong/">link</a></li> <li>Solr has a new vulnerabliity - <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-0192">CVE-2019-0192</a></li> <li>Azure Premium <a href="/technologies/microsoft-azure-blob-storage">Blob Storage</a> is in public preview - <a href="https://azure.microsoft.com/en-gb/blog/azure-premium-blob-storage-public-preview/">link</a></li> <li>Couple of updates from the Hue blog - <a href="http://gethue.com/self-service-impala-sql-query-troubleshooting/">Impala query explain plans and troubleshooting</a> and <a href="http://gethue.com/hue-in-docker/">Hue in Docker</a></li> <li>And finally, from the Flink blog, how to integrate <a href="/technologies/apache-flink">Flink</a> and Prometheus - <a href="https://flink.apache.org/features/2019/03/11/prometheus-monitoring.html">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/03/13/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 06/03/2019</title><link>https://ondataengineering.github.io/blog/2019/03/06/the-mid-week-news/</link><pubDate>Wed, 06 Mar 2019 07:30:00 +0000</pubDate> <description> <p>Ok, after the bumper two week edition last week, it’s slim pickings this week… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li>None</li> </ul> <p>Other technology news:</p> <ul> <li>From Datanami, can <a href="/tech-categories/object-stores/">Object Stores</a> now compete with <a href="/technologies/apache-hadoop/hdfs/">HDFS</a> for analytic workloads - <a href="https://www.lightbend.com/blog/a-glimpse-at-the-future-of-apache-spark-30-with-deep-learning-and-kubernetes">link</a></li> <li>Following on from recent <a href="/technologies/greenplum/">Greenplum</a> support for cloud object storage, they have an article on deploying Greenplum in the cloud - <a href="https://greenplum.org/deploying-in-the-cloud/">link</a></li> <li>Amazon have a custom <a href="/technologies/apache-spark">Spark</a> output committer that’s optimised for working with <a href="/technologies/amazon-s3/">Amazon S3</a> storage - <a href="https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/">link</a></li> <li>If you’re an Ambari user (for example you’re running HDP), here’s how to use <a href="/technologies/hue/">Hue</a> with your cluster - <a href="http://gethue.com/configure-ambari-hdp-with-hue/">link</a></li> <li>From Datanami - is stronger data protections coming to the US following GDPR in the EU - <a href="https://www.datanami.com/2019/02/28/surviving-the-coming-data-governance-wave">link</a></li> <li>Via DZone, a video presentatoin of thoughts of what might be in <a href="/technologies/apache-spark">Spark</a> 3.0 - <a href="https://www.lightbend.com/blog/a-glimpse-at-the-future-of-apache-spark-30-with-deep-learning-and-kubernetes">link</a></li> <li>And finally, from the Apache Incubator, there’s a new proposal for DataSketches, a project from Yahoo for providing highly efficient approximations with mathematical guarentees to expensive complex computations on streaming data - <a href="https://wiki.apache.org/incubator/DataSketchesProposal">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/03/06/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 27/02/2019</title><link>https://ondataengineering.github.io/blog/2019/02/27/the-mid-week-news/</link><pubDate>Wed, 27 Feb 2019 07:30:00 +0000</pubDate> <description> <p>Right, we’ve back from our short break, so let’s catch up on the news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-nifi">NiFi</a> has hit 1.9</li> <li><a href="/technologies/databricks-delta/">Databricks Delta</a> has new features</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> 7.0 beta 1 is now out</li> <li><a href="/technologies/greenplum/">Greenplum</a> is up to 5.16</li> <li><a href="/technologies/influxdb/">InfluxDB</a> 2.0 alpha 4 is out</li> </ul> <p>Other technology news:</p> <ul> <li>Azure have new announcements relating to their Data Services, including Azure Data Explorer (now added to our <a href="/tech-vendors/microsoft-azure">Azure</a> vendor page) - <a href="https://azure.microsoft.com/en-us/blog/individually-great-collectively-unmatched-announcing-updates-to-3-great-azure-data-services/">link</a></li> <li>Forrester have a new Cloud Hadoop/Spark Platforms Wave out (now included in our <a href="/tech-categories/hadoop-distributions/">Hadoop Distributions</a> page), and Cloudera are crowing about their position - <a href="https://www.forrester.com/report/The+Forrester+Wave+Cloud+HadoopSpark+Platforms+Q1+2019/-/E-RES142663">Forrester</a>; <a href="http://vision.cloudera.com/clouderas-and-hortonworks-data-platform-in-the-cloud-named-among-leaders-in-new-forrester-wave/">Cloudera</a></li> <li>From the <a href="/technologies/apache-flink/">Apache Flink</a> blog - Blink, a fork from Alibaba that improves batch processing performance that’s being folded back in - <a href="https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html">link</a></li> <li>More from the <a href="/technologies/apache-flink/">Apache Flink</a> blog - best practices in monitoring - <a href="https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html">link</a></li> <li>There’s some stuff doing the rounds on the success of <a href="/technologies/apache-arrow/">Apache Arrow</a> - <a href="https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46">Apache</a>; <a href="https://www.zdnet.com/article/apache-arrow-the-little-data-accelerator-that-could/">ZDNet</a></li> <li>DB-Engines (the great database ranking site), has updated the way it handles multi-model databases - <a href="https://db-engines.com/en/blog_post/80">link</a></li> <li>From InfoQ, a presentation from WePay on their use of Debezium to stream MySQL database changes into Google BigQuery - <a href="https://www.infoq.com/presentations/wepay-database-streaming">link</a></li> <li>Amazon have been doing some work so that Spark better handles node loss - <a href="https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr/">link</a></li> <li>LinkedIn have a writeup from their community event on the future of Hadoop - <a href="https://engineering.linkedin.com/blog/2019/02/the-present-and-future-of-apache-hadoop--a-community-meetup-at-l">link</a></li> <li>Qlik have aquired Attunity - <a href="https://www.zdnet.com/article/qlik-to-acquire-attunity-for-560m/">ZDNet</a></li> <li>Google have acquired Alooma - <a href="https://cloud.google.com/blog/topics/inside-google-cloud/google-announces-intent-to-acquire-alooma-to-simplify-cloud-migration">Google</a>; <a href="https://www.datanami.com/2019/02/21/google-doubles-down-on-cloud-data-migration/">Datanami</a>; <a href="https://www.theregister.co.uk/2019/02/20/google_buys_alooma/">The Register</a></li> <li>Uber have open sourced AresDB, a “GPU-Powered Open Source, Real-time Analytics Engine”, now included in out <a href="/tech-categories/analytical-databases/">Analytical Databases</a> list - <a href="https://eng.uber.com/aresdb/">link</a></li> <li>Solr has a CVE advisory out - <a href="http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=2017-3164">CVE-2017-3164</a> - SSRF issue in Apache Solr</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/02/27/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 13/02/2019</title><link>https://ondataengineering.github.io/blog/2019/02/13/the-mid-week-news/</link><pubDate>Wed, 13 Feb 2019 07:30:00 +0000</pubDate> <description> <p>I’m off next week, so no news then, and when I’m back we’ll see if we can kick start some more content… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-beam/">Apache Beam</a> has hit 2.10</li> <li><a href="/technologies/apache-solr/">Apache Solr</a> has hit 7.7</li> <li><a href="/technologies/cloudbreak/">Cloudbreak</a> is up to 2.9</li> <li><a href="/technologies/qubole-data-service/">Qubole Data Service</a> is up to R55 on AWS</li> <li>Hortonworks <a href="/technologies/hortonworks-dataplane/streams-messaging-manager/">Streams Messaging Manager</a> is up to 1.2</li> <li><a href="/technologies/influxdb/">InfluxDB</a> has it’s 2.0 alpha release out</li> <li><a href="/technologies/pravega/">Pravega</a> is up to 0.4</li> </ul> <p>Other technology news:</p> <ul> <li>LinkedIn have move on Cruise Control, their open source product for managing <a href="/technologies/apache-kafka">Kafka</a> clusters - <a href="https://engineering.linkedin.com/blog/2019/02/introducing-kafka-cruise-control-frontend">link</a></li> <li>Something interested from Forrester on Data Governance, and where this might be going - <a href="https://go.forrester.com/blogs/data_governance_takes_a_turn_toward_ambient/">link</a></li> <li>We should have a look at this at some point - Datanami have an article on Lentiq EdgeLake, a fully federated lake product/architecture - <a href="https://www.datanami.com/2019/02/07/lentiq-launches-edgelake-with-some-fanfare/">link</a></li> <li>Qubole have an update on Rubix, their open source distributed object store cache, on using it with Spark - <a href="https://www.qubole.com/blog/increase-apache-spark-performance-with-rubix-distributed-cache/">link</a></li> <li>From the ever excellent Juku.IT - how to manage your data in a multi-cloud strategy - <a href="https://www.juku.it/isnt-it-time-to-rethink-your-cloud-strategy/">link</a></li> <li>At the risk of propagating Cloudera adverts - their view on why Gartner have positioned them furthest for “Completeness of Vision” in the recent Magic Quadrant for Data Management Solutions for Analytics - <a href="http://vision.cloudera.com/three-takeaways-from-gartners-2019-magic-quadrant-for-data-management-solutions-for-analytics/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/02/13/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 06/02/2019</title><link>https://ondataengineering.github.io/blog/2019/02/06/the-mid-week-news/</link><pubDate>Wed, 06 Feb 2019 07:30:00 +0000</pubDate> <description> <p>Another week rolls round, so let’s do the news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/cloudera-data-science-workbench/">Cloudera Data Science Workbench</a> is up to 1.5</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> has hit 6.5, along with <a href="/technologies/elasticsearch-hadoop/">Elasticsearch Hadoop</a></li> <li><a href="/technologies/elastic-cloud/">Elastic Cloud</a> is up to 2.1</li> <li><a href="/technologies/apache-accumulo/">Apache Accumulo</a> has is second 2.0 alpha release out</li> </ul> <p>Other technology news:</p> <ul> <li>As well as the previous announcements on the single Hadoop product line, Cloudera have now announced Cloudera DataFlow (CDF) - the future of HDF in the new Cloudera - <a href="http://vision.cloudera.com/cloudera-dataflow/">link</a></li> <li><a href="/technologies/presto/">Presto</a> now has an open source foundation behind it, the Presto Software Foundation - <a href="https://www.starburstdata.com/technical-blog/the-presto-software-foundation/">link</a></li> <li><a href="/technologies/apache-airflow">Apache Airflow</a> has now graduated from the Apache Incubator - <a href="https://wiki.apache.org/incubator/January2019">link</a></li> <li><a href="/technologies/databricks-delta/">Databricks Delta</a> now supports querying data as it was any any point in history - <a href="https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html">link</a></li> <li>Confluent have (another) post on why event driven architectures are the future - <a href="https://www.confluent.io/blog/journey-to-event-driven-part-1-why-event-first-thinking-changes-everything">link</a></li> <li>Google have announced their own Kubernetes integration for <a href="/technologies/apache-spark/">Apache Spark</a> for Google Cloud Platform - <a href="https://cloud.google.com/blog/products/data-analytics/data-analytics-meet-containers-kubernetes-operator-for-apache-spark-now-in-beta">link</a></li> <li>Solutions Review have their analysis of the latest Gartner Magic Quadrant for Data Management Solutions for Analytics - <a href="https://solutionsreview.com/data-management/whats-changed-2019-gartner-magic-quadrant-for-data-management-solutions-for-analytics/">link</a></li> <li>From Starburst, their thoughts on the separation of compute and storage for analytics - <a href="https://www.starburstdata.com/technical-blog/art-of-abstraction/">link</a></li> <li>From The Register, Databricks has another £250m in funding, including from Microsoft - <a href="https://www.theregister.co.uk/2019/02/05/databricks_series_e_250m/">link</a></li> <li>Bloor have updated their Graph Database Market Update - <a href="https://www.bloorresearch.com/technology/graph-databases/">link</a></li> <li>GridGain have announced that they now have a support offering for <a href="/technologies/apache-ignite/">Apache Ignite</a> alongside their GridGain offerings - <a href="https://www.gridgain.com/resources/blog/gridgain-introduces-first-support-offering-apacher-ignitetm-users">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/02/06/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 30/01/2019</title><link>https://ondataengineering.github.io/blog/2019/01/30/the-mid-week-news/</link><pubDate>Wed, 30 Jan 2019 07:30:00 +0000</pubDate> <description> <p>News time again, and it’s a quiet week… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-hadoop/">Apache Hadoop</a> has hit 3.2</li> <li><a href="/technologies/influxdb/">InfluxDB</a> has a 2.0 alpha release out</li> </ul> <p>Other technology news:</p> <ul> <li>From the ever excellent The Morning Papers - offloading MPP query processing to the network switch - <a href="https://blog.acolyer.org/2019/01/28/the-case-for-network-accelerated-query-processing/">link</a></li> <li>Confluent has raised another $125m - <a href="https://www.confluent.io/blog/confluent-raises-a-125m-series-d-funding-round">blog</a></li> <li>And on the subject on Confluent, ZDNet have a good piece on them - <a href="https://www.zdnet.com/article/confluent-shows-open-source-paradigm-shifts-cloud-and-commercial-success-can-all-co-exist/">link</a></li> <li>From The Register, the number of malicious attacks on Hadoop clusters are on the rise - <a href="https://www.theregister.co.uk/2019/01/24/hadoop_malware_attack/">link</a></li> <li>And from Bloor, their latest paper on comparative costs of different Data Integration platforms based on customer surveys (requires payment for commercial use) - <a href="https://www.bloorresearch.com/research/comparative-costs-and-uses-for-data-integration-platforms-4th-edition/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/01/30/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 23/01/2019</title><link>https://ondataengineering.github.io/blog/2019/01/23/the-mid-week-news/</link><pubDate>Wed, 23 Jan 2019 07:30:00 +0000</pubDate> <description> <p>Right, time for the news again… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-arrow/">Apache Arrow</a> has hit 0.12</li> <li><a href="/technologies/greenplum/">Greenplum</a> has hit 5.16</li> </ul> <p>Other technology news:</p> <ul> <li><a href="/technologies/hudi/">Hudi</a>, which we looked at recently, and was only submitted to the Apache Incubator two weeks ago has now entered incubation - <a href="http://incubator.apache.org/projects/hudi.html">link</a></li> <li>I’d not considered this before, but Alluxio have a post on using <a href="/technologies/alluxio/">Alluxio/Tachyon</a> (but would equally apply to other in memory <a href="/tech-categories/hadoop-compatible-filesystems">Hadoop Compatible Filesystems</a>) to reduce the performance impact of using object stores or remote (cloud) storage - <a href="https://www.alluxio.com/blog/deploying-big-data-workloads-on-object-storage-without-performance-penalty">link</a></li> <li>From from Datanami on 2019 Big Data trends - <a href="https://www.datanami.com/2019/01/21/10-big-data-trends-to-watch-in-2019/">link</a></li> <li>Bloor have their latest <a href="/tech-categories/graph-databases/">Graph</a> and <a href="/tech-categories/rdf-databases/">RDF</a> database market update out -<a href="https://www.bloorresearch.com/research/graph-database-market-update-2019/">link</a></li> <li>From ZDNet, looks like MariaDB now have a unified product supporting both AX analytics and TX transactional nodes under a common database engine - <a href="https://www.zdnet.com/article/mariadb-unifies-its-platform/">link</a></li> <li>From The Register, an update on Teradata, their strategy and new CEO - <a href="https://www.theregister.co.uk/2019/01/16/teradata_new_ceo/">link</a></li> <li>From The Morning Papers - SageDB, a database “where learned models pervade every aspect of a database system” - <a href="https://blog.acolyer.org/2019/01/16/sagedb-a-learned-database-system/">link</a></li> <li>From Confluent - testing <a href="/technologies/apache-kafka/kafka-streams/">Kafka Streams</a> - <a href="https://www.confluent.io/blog/stream-processing-part-2-testing-your-streaming-application">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/01/23/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 16/01/2019</title><link>https://ondataengineering.github.io/blog/2019/01/16/the-mid-week-news/</link><pubDate>Wed, 16 Jan 2019 07:30:00 +0000</pubDate> <description> <p>We’re a week into our new year, and it’s time for the news again… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-flume/">Apache Flume</a> is up to 1.9 (over a year since it’s last release)</li> <li><a href="/technologies/apache-kylin/">Apache Kylin</a> is up to 2.6</li> <li><a href="/technologies/streamsets-data-collector/">StreamSets Data Collector</a> is up to 3.7</li> </ul> <p>Other technology news:</p> <ul> <li>Datanami have the lowdown on the new <a href="/tech-vendors/cloudera/">Cloudera</a> roadmap <a href="https://www.datanami.com/2019/01/10/cloudera-unveils-cdp-talks-up-enterprise-data-cloud/">link</a></li> <li><a href="/tech-vendors/amazon-web-services/">Amazon Web Services</a> has a new NoSQL Document DB service compatible with MongoDB - <a href="https://aws.amazon.com/about-aws/whats-new/2019/01/amazon-documentdb-with-mongodb-compatibility-generally-available/">announcement</a>; <a href="https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/">blog</a></li> <li>Parquet performance with Hive has got a boost from <a href="/technologies/cloudera-cdh">CDH</a> 6.0 with support for vectorisation - <a href="https://blog.cloudera.com/blog/2018/12/faster-swarms-of-data-accelerating-hive-queries-with-parquet-vectorization/">link</a></li> <li>Hortonworks are continuing to push federation capabilities in <a href="/technologies/apache-hive">Hive</a> - <a href="https://hortonworks.com/blog/query-federation-with-hive/">JDBC federation with push down</a>; <a href="https://hortonworks.com/blog/introducing-hive-kafka-sql/">Kafka federation</a></li> <li>And Hortonworks are also talking about their better <a href="/technologies/apache-hive">Hive</a> and <a href="/technologies/apache-spark">Spark</a> integration via their Apache Hive Warehouse Connector - <a href="https://hortonworks.com/blog/hive-warehouse-connector-use-cases/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/01/16/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 09/01/2019</title><link>https://ondataengineering.github.io/blog/2019/01/09/the-mid-week-news/</link><pubDate>Wed, 09 Jan 2019 07:30:00 +0000</pubDate> <description> <p>Hello - and welcome back. I hope you’ve had a good break. Let’s see what’s happened whilst we’ve been gone… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li>Cloudera have released v6.1 of <a href="/technologies/cloudera-cdh/">CDH</a>, <a href="/technologies/cloudera-manager/">Cloudera Manager</a>, <a href="/technologies/cloudera-navigator/">Cloudera Navigator</a> and <a href="/technologies/cloudera-altus/director/">Cloudera Altus Director</a></li> <li><a href="/technologies/apache-calcite/">Apache Calcite</a> has hit 1.18</li> <li><a href="/technologies/apache-drill/">Apache Drill</a> has hit 1.15</li> <li><a href="/technologies/apache-knox/">Apache Knox</a> has hit 1.2</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> is up to 5.20</li> <li><a href="/technologies/hortonworks-dataplane/data-analytics-studio/">Hortonworks Data Analytics Studio</a> has hit 1.2</li> <li><a href="/technologies/opentsdb/">OpenTSDB</a> has hit 2.4</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> has a 7.0-alpha2 release out if you’re brave/interested</li> <li><a href="/technologies/apache-datafu/">Apache DataFu</a>] is up to 1.5</li> </ul> <p>Other technology news:</p> <ul> <li>Alibaba has acquired Data Artisans, the backers of <a href="/technologies/apache-flink/">Apache Flink</a> - <a href="https://www.datanami.com/2019/01/08/alibaba-acquires-apache-flink-backer-data-artisans">Datanami</a></li> <li>The <a href="/tech-vendors/cloudera/">Cloudera</a> + <a href="/tech-vendors/hortonworks/">Hortonworks</a> merger is now complete - <a href="http://vision.cloudera.com/the-new-cloudera/">Cloudera</a>; <a href="https://hortonworks.com/blog/welcome-brand-new-cloudera/">Hortonworks</a>; <a href="https://www.theregister.co.uk/2019/01/07/cloudera_hortonworks_merger_completed/">The Register</a>; <a href="https://www.zdnet.com/article/cloudera-and-hortonworks-merger-closes-quo-vadis-big-data/">ZDNet</a>; <a href="https://www.datanami.com/2019/01/03/merger-with-hortonworks-complete-cloudera-looks-to-future/">Datanami</a></li> <li>It’s a new year, so everyone has their review of 2018 and look forward to 2019 articles out - <a href="https://www.zdnet.com/article/data-crystal-balls-looking-glasses-and-boiling-frogs-reviewing-2018-predicting-2019/">ZDNet</a>; <a href="https://www.zdnet.com/article/big-data-2019-cloud-redefines-the-database-and-machine-learning-runs-it/">ZDNet again</a>; <a href="https://www.zdnet.com/article/predictions-for-2019-in-data-analytics-and-ai/">And again ZDNet</a>; <a href="https://www.datanami.com/2019/01/02/industry-speaks-big-data-prognostications-for-2019/">Datanami</a></li> <li><a href="/technologies/hudi/">Hudi</a>, which we looked at recently, has just been submitted to the Apache Incubator - <a href="https://wiki.apache.org/incubator/HudiProposal">link</a></li> <li>Cloudera has a new edition of <a href="/technologies/cloudera-data-science-workbench/">Cloudera Data Science Workbench</a> coming that runs over Kubernetes (instead of Hadoop) as is targetted at Machine Learning in the cloud called Cloudera Machine Learning - <a href="https://www.cloudera.com/about/news-and-blogs/press-releases/2018-12-05-cloudera-announces-preview-of-cloud-native-machine-learning-platform-to-accelerate-the-industrialization-of-ai.html">announcement</a>; <a href="https://www.zdnet.com/article/cloudera-machine-learning-release-takes-cloud-native-path/">ZDNet view</a></li> <li>There’s more on the move to more restrictive open source licences to combat the exploitation of open source technologies by cloud providers - <a href="https://www.datanami.com/2018/12/24/cloud-backlash-grows-as-open-source-gets-less-open/">Datanami</a>; <a href="https://www.influxdata.com/blog/copyleft-and-community-licenses-are-not-without-merit-but-they-are-a-dead-end/">Influx</a></li> <li>Via DBEngines - a proposal for GQL, a new standard graph query langage - <a href="https://db-engines.com/en/blog_post/78">link</a></li> <li><a href="/technologies/apache-airflow">Apache Airflow</a> is now a top level project - <a href="https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces44">link</a></li> <li>Apache Gearpump (a streaming service based on a micro-service Actor model) which was incubating has now been retired - <a href="http://incubator.apache.org/projects/gearpump">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2019/01/09/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 19/12/2018</title><link>https://ondataengineering.github.io/blog/2018/12/19/the-mid-week-news/</link><pubDate>Wed, 19 Dec 2018 07:30:00 +0000</pubDate> <description> <p>Right - the last news before Christmas. We’ll be back in the new year - hope you all have a lovely holiday (whatever you’re doing)… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/hortonworks-data-platform/">Hortonworks Data Platform</a> has hit 3.1</li> <li>Hortonworks <a href="/technologies/hortonworks-dataplane/data-lifecycle-manager/">Data Lifecycle Manager</a> and <a href="/technologies/hortonworks-dataplane/data-steward-studio/">Data Steward Studio</a> have both hit 1.3</li> <li><a href="/technologies/apache-beam/">Apache Beam</a> has hit 2.9</li> <li><a href="/technologies/apache-solr/">Apache Solr</a> has hit 7.6</li> <li><a href="/technologies/elastic-cloud/">Elastic Cloud</a> has hit 2.0</li> <li><a href="/technologies/greenplum/">Greenplum</a> has hit 5.15</li> </ul> <p>Other technology news:</p> <ul> <li>It’s in the HDP 3.1 and HDF 3.3 release links, but we’ll call this out separately. As part of these releases, you can now run Hive and Druid queries over Kafka queues - <a href="https://hortonworks.com/blog/democratizing-analytics-within-kafka-three-new-access-patterns/">link</a></li> <li>And again from Hortonworks - when to use HBase vs Hive vs Druid - <a href="https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/">link</a></li> <li>And one more from Hortonworks - <a href="/technologies/apache-hive">Hive</a> (LLAP + Tez) is now twice as fast as part of HDP 3.0 due to support for materialized views, SQL constraints, query result caching and vectorization - <a href="https://hortonworks.com/blog/2x-faster-bi-interactive-queries-with-hdp-3-0/">link</a></li> <li>Under the covers with <a href="/technologies/mapr-converged-data-platform/">MapR Converged Data Platform</a>apR - <a href="https://mapr.com/blog/mapr-data-platform-under-the-hood/">link</a></li> <li>From ZDNet - the rise of Kubernetes for Big Data, and thoughts on Ozone and the Hortonworks Open Data Initiative - <a href="https://www.zdnet.com/article/the-rise-of-kubernetes-epitomizes-the-move-from-big-data-to-flexible-data/">link</a></li> <li>From Cloudera, tracking and assesing <a href="/technologies/apache-impala/">Apache Impala</a> performance - <a href="https://blog.cloudera.com/blog/2018/12/assessment-of-apache-impala-performance-using-cloudera-manager-metrics-part-1-of-3/">link</a></li> <li>GraphIt - a new DSL for specifying graph algorithms that delivers huge performance increases - <a href="https://www.datanami.com/2018/12/10/graphit-promises-big-speedup-in-graph-processing/">Datanami</a>; <a href="https://blog.acolyer.org/2018/12/12/graphit-a-high-performance-graph-dsl/">The Morning Paper</a></li> <li>From Juku.IT - thoughts on Amazon Output - their new hybrid cloud - <a href="https://www.juku.it/the-cloud-is-more-hybrid-now/">link</a></li> <li>This is worth reading from InfoQ - will Cloud Computing Kill Open Source - does the use of open source software by the big cloud providers without contributing back destroy the commercial models for investing in open source - <a href="https://www.infoq.com/articles/will-cloud-computing-kill-open-source">link</a></li> <li>And on the same note, Confluence have updated the licence for their open source <a href="/technologies/confluent-open-source/">Confluent Open Source</a> components to prevent provisioning of them as a service - <a href="https://www.confluent.io/blog/license-changes-confluent-platform">link</a></li> <li>From ZDNet - IBM have a new multi-cloud strategy - <a href="https://www.zdnet.com/article/ibm-bets-on-a-multi-cloud-future/">link</a></li> <li>Last one from Hortonworks today - mounting storage inside YARN containers using CNCF Container Storage Interface (csi) drivers - <a href="https://hortonworks.com/blog/open-hybrid-architecture-running-stateful-containers-on-yarn/">link</a></li> <li><a href="/technologies/apache-arrow">Arrow</a> has a new expression compiler and execution kernel - <a href="https://arrow.apache.org/blog/2018/12/05/gandiva-donation/">link</a></li> <li>YugaByte DB (a multi-model db), and running Presto queries over it - <a href="https://blog.yugabyte.com/presto-on-yugabyte-db-interactive-olap-sql-queries-made-easy-facebook/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/12/19/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 12/12/2018</title><link>https://ondataengineering.github.io/blog/2018/12/12/the-mid-week-news/</link><pubDate>Wed, 12 Dec 2018 07:30:00 +0000</pubDate> <description> <p>Apologies for the radio silence - I’ve been ill. But to make up for it, it’s a monster bumper news this week… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-bigtop/">Apache BigTop</a> has hit 1.3</li> <li><a href="/technologies/apache-flink/">Apache Flink</a> has hit 1.7</li> <li><a href="/technologies/apache-gobblin/">Apache Gobblin</a> has hit 0.14</li> <li><a href="/technologies/apache-ignite/">Apache Ignite</a> has hit 2.7</li> <li><a href="/technologies/apache-impala/">Apache Impala</a> is up to 3.1</li> <li><a href="/technologies/apache-kafka/">Apache Kafka</a> has hit 2.1</li> <li><a href="/technologies/apache-parquet/">Apache Parquet</a> has hit a 1.11 release of it’s Map Reduce implementation</li> <li><a href="/technologies/cloudera-cdh/">CDH</a> and <a href="/technologies/cloudera-manager/">Cloudera Manager</a> is up to 5.16; <a href="/technologies/cloudera-navigator/">Cloudera Navigator</a> to 2.15</li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> has hit 6.5, along with <a href="/technologies/elasticsearch-hadoop/">Elasticsearch Hadoop</a></li> <li><a href="/technologies/greenplum/">Greenplum</a> has hit 5.14</li> <li><a href="/technologies/hortonworks-data-flow/">Hortonworks Data Flow</a> is up to 3.3</li> <li><a href="/technologies/qubole-data-service/">Qubole</a> has hit R54</li> <li><a href="/technologies/streamsets-data-collector/">Streamsets Data Collector</a> has hit 3.6</li> <li><a href="/technologies/apache-hadoop/ozone/">Apache Hadoop Ozone</a> is up to Ozone 0.3 alpha</li> </ul> <p>Other technology news:</p> <ul> <li>Some Apache Incubator updates <ul> <li>Quickstep (a high performance database engine) has been retired from the incubator due to lack of activity</li> <li>Iceberg (the file based table store) from Netflix and IoTDB (a time series db) have both been accepted into the Incubator</li> <li>Griffen (the Data Quality Service platform built on Apache Hadoop and Apache Spark) has graduated</li> </ul> </li> <li>Hortonworks have a blog post on Ambari 2.7 and why it’s great (months after it’s release) - <a href="https://hortonworks.com/blog/whats-great-apache-ambari-2-7/">https://hortonworks.com/blog/whats-great-apache-ambari-2-7/</a></li> <li><a href="/technologies/elasticsearch/">Elasticsearch</a> now supports Kubernetes deployments, with Helm charts available from Elastic - <a href="https://www.elastic.co/blog/alpha-helm-charts-for-elasticsearch-kibana-and-cncf-membership">link</a></li> <li>Hortonworks have more on their cloud journey for <a href="/technologies/apache-hadoop/">Hadoop</a> and <a href="/technologies/hortonworks-data-platform/">HDP</a> - <a href="https://hortonworks.com/blog/open-hybrid-architecture-why-participating-in-the-cloud-native-computing-foundation/">link</a></li> <li>Microsoft have more features announced for <a href="/technologies/microsoft-azure-data-lake-store/">Azure Data Lake Store</a> Gen2 - <a href="https://azure.microsoft.com/en-us/blog/azure-data-lake-storage-gen2-preview-more-features-more-performance-better-availability/">link</a></li> <li>Also from Microsoft, a comprehensive guide for the Information they’ve published on <a href="/technologies/azure-hdinsight/">HDInsight</a> - <a href="https://azure.microsoft.com/en-us/blog/get-up-to-speed-with-azure-hdinsight-the-comprehensive-guide/">link</a></li> <li>Amazon have a monster pile of AWS announcements: <ul> <li>For <a href="/technologies/amazon-s3/">Amazon S3</a> we have <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/s3-glacier-api-simplification/">full Glacier integration</a>; <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/s3-object-lock/">object lock</a>; <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/s3-glacier-deep-archive/">Glacier Deep Archive</a>; <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/s3-batch-operations/">batch operations</a>; <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-s3-block-public-access/">blocking public access</a> and <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/s3-intelligent-tiering/">Intelligence Tiering</a></li> <li>Amazon Textract is a new OCR service - <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-textract-now-in-preview-easily-extract-text-and-data-from-virtually-any-document/">link</a></li> <li>Amazon Lake Formation is a new service for setting up an S3 data lake, populating it with data from a range of source systems, and then securing the data - <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-aws-lake-formation/">link</a></li> <li>Amazon Timestream is a new Time Series database - <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-amazon-timestream/">link</a></li> <li>You have now pause and resume EC2 instances backed by EBS - <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-ec2-now-lets-you-pause-and-resume-your-workloads/">link</a></li> <li>There’s a new Kafka managed service (Managed Streaming for Kafka - MSK) - <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-managed-streaming-for-kafka-in-public-preview/">link</a></li> <li>And you can now get Lustre filesystems as a service on AWS - <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-fsx-lustre/">link</a></li> </ul> </li> <li>And from Azure this week, you can now get MariaDB as a manged service - <a href="https://azure.microsoft.com/en-gb/blog/announcing-the-general-availability-of-azure-database-for-mariadb/">link</a></li> <li>RedHat have acquired NooBaa - an object storage solution - <a href="https://www.juku.it/red-hat-acquires-noobaa-beefing-up-its-storage-portfolio/">link</a></li> <li>More from Hortonworks on Ozone - <a href="https://hortonworks.com/blog/open-hybrid-architecture-o3-the-new-rocket-ship/">link</a></li> <li>From Datanami - an article on Pachydern, an interesting alternative to Hadoop - <a href="https://www.datanami.com/2018/11/20/inside-pachyderm-a-containerized-alternative-to-hadoop/">link</a></li> <li>Samza 1.0 is out - <a href="https://engineering.linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale">link</a>; <a href="https://www.zdnet.com/article/real-time-data-processing-just-got-more-options-linkedin-releases-apache-samza-1-0-streaming/">ZDNet</a></li> <li>From Datanami - looks like Cloudera have a new ML platform coming - <a href="https://www.datanami.com/2018/12/05/cloudera-gives-a-peek-at-future-ml-platform/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/12/12/the-mid-week-news/</guid> </item> <item><title>Hudi</title><link>https://ondataengineering.github.io/technologies/hudi/</link><pubDate>Thu, 15 Nov 2018 00:00:00 +0000</pubDate> <description> <p>Spark library for managing tabular structured data on Hadoop that supports atomic transactions, near real time ingestion and quering, incremental reading of data for further processing and upserts, updates & deletes. Data is stored in HDFS, with a folder for each table partition, and with data files chunked by Hadoop block size (with each chunk allocated a unique fileid). Supports two storage mechanisms - Copy on Write (maintains a data as a Parquet file for each chunk that's re-written for updates and deletes) and Merge on Read (also maintains data as a Parquet file, however new data for a chunk is written to an Avro delta file, with an async background compaction process to merge all new delta files into the Parquet file for a chunk). Data is queryable via Hive, Presto and SparkSQL via a custom InputReaders through three views - Read Optimised (only queries Parquet files), Real Time (queries both Parquet and Avro delta files, merging in the deltas at query time) and Incremental (only reads Avro files to provide new data since a given commit). Supports strongly consistent atomic transactional commits (with a commit log (the timeline) used to prevent data from being queried until it is commmitted, and with support for automatically rolling back failed commits and the ability to manually rollback specific commits) and read isolation (all data filenames include the commit id meaning data files are never modified once committed, with a cleanup process to remove old redudant files). Compactions are non blocking, lock free and asynchronous, with pluggable strategies for prioritising compactions. All records must have a unique key, with a key lookup (either via bloom filter of external HBase table) used to identify updates and identify which chunk that update should be applied to. Also pluggable to support alternative storage formats to Parquet and Avro if required. Spark APIs includes support for incremental reads, bulk inserts, upserts and Spark SQL, and includes integration with Hive and Presto (including a Hive Metadata sync tool that incrementally pushes table and partition metadata to the Hive metastore for Hive and Presto), a CLI, the ability to generate Graphite metrics and a number of utilities (including the ability to stream data from Kafka and Sqoop into Hudi). Created at Uber where it's used in production, and open sourced in December 2016. Name stands for Hadoop Upserts anD Incrementals.</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Other Names</td><td>Hoodi</td></tr> <tr><td>Type</td><td>Open Source</td></tr> <tr><td>Last Updated</td><td>November 2018</td></tr> </tbody> </table> <h2 id="links">Links</h2> <ul> <li><a href="https://uber.github.io/hudi/index.html">https://uber.github.io/hudi/index.html</a> - homepage and documentation</li> <li><a href="https://github.com/uber/hudi">https://github.com/uber/hudi</a> - source code</li> <li><a href="https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/detail/70937">https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/detail/70937</a>; <a href="https://conferences.oreilly.com/strata/strata-ca-2017/public/schedule/detail/56511">https://conferences.oreilly.com/strata/strata-ca-2017/public/schedule/detail/56511</a> - intro presentations</li> <li><a href="https://eng.uber.com/hoodie/">https://eng.uber.com/hoodie/</a>; <a href="https://eng.uber.com/uber-big-data-platform/">https://eng.uber.com/uber-big-data-platform/</a> - blog posts</li> </ul> <h2 id="news">News</h2> <ul> <li><a href="https://eng.uber.com/tag/hoodie/">https://eng.uber.com/tag/hoodie/</a> - blog posts</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/hudi/</guid> </item> <item><title>The Mid Week News 14/11/2018</title><link>https://ondataengineering.github.io/blog/2018/11/14/the-mid-week-news/</link><pubDate>Wed, 14 Nov 2018 07:30:00 +0000</pubDate> <description> <p>It’s Wednesday again, which means it’s time for the news again… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/greenplum/">Greenplum</a> has hit 5.13</li> <li><a href="/technologies/apache-spark/">Apache Spark</a> has hit 2.4</li> <li><a href="/technologies/influxdb/">InfluxDB</a> has hit 1.7</li> <li><a href="/technologies/amazon-emr/">Amazon EMR</a> has hit 5.18</li> </ul> <p>Other technology news:</p> <ul> <li>MapR have announced their Clarity program aimed as disaffected Cloudera and Hortonworks clusters - <a href="https://mapr.com/blog/the-mapr-clarity-program-is-your-clear-path-to-ai-hybrid-and-multi-cloud-containers-and-operational-analytics/">link</a>; <a href="https://www.datanami.com/2018/11/07/mapr-targets-cloudera-hortonworks-customers-with-clarity-release/">Datanami view</a></li> <li>Uber have a great article on the evolution of their 100+ Petabyte Big Data Platform - <a href="https://eng.uber.com/uber-big-data-platform/">link</a>; <a href="https://www.infoq.com/news/2018/11/uber-big-data-evolution">InfoQ view</a></li> <li>Airbnb have a post on how they use <a href="/technologies/apache-druid/">Apache Druid</a> - <a href="https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c">link</a></li> <li>Netflix have donated Iceberg - their on disk structured table store to the Apache foundation - <a href="https://wiki.apache.org/incubator/IcebergProposal">link</a></li> <li>And on the subject of <a href="/technologies/apache-druid/">Druid</a> - how to configure it to use Minio (and more generally any object store) for deep storage - <a href="https://cleanprogrammer.net/how-to-configure-druid-to-use-minio-as-deep-storage/">link</a></li> <li>MapR have started a series on scale out multi-purpose data platforms (any why theirs is the best) - <a href="https://mapr.com/blog/in-search-of-a-data-platform/">link</a></li> <li>RedHat have announced AMQ Streams - a Kafka distribution that runs on OpenShift - <a href="https://www.redhat.com/en/blog/announcing-red-hat-amq-streams-apache-kafka-red-hat-openshift">link</a>; <a href="https://www.datanami.com/2018/11/12/red-hat-adds-kafka-streaming-to-openshift/">Datanmi view</a></li> <li><a href="/technologies/apache-kafka/">Apache Kafka</a> now supports 200K partitions per cluster - <a href="https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions">link</a></li> <li>First of a few from Datanami - a view on Cloud Warehouses - <a href="https://www.datanami.com/2018/11/08/whats-driving-the-cloud-data-warehouse-explosion/">link</a></li> <li>And also from Datanami - Talend have brought Stitch - <a href="https://www.datanami.com/2018/11/07/21615/">link</a></li> <li>Azure Event Hubs now has a Kafka compatible end point API - <a href="https://azure.microsoft.com/en-us/blog/announcing-the-general-availability-of-azure-event-hubs-for-apache-kafka/">link</a></li> <li>Azure SQL Data Warehouse has a bunch of new features out - <a href="https://azure.microsoft.com/en-us/blog/azure-sql-data-warehouse-taking-scalability-security-and-manageability-to-new-heights/">link</a>; <a href="https://azure.microsoft.com/en-gb/blog/row-level-security-is-now-supported-for-azure-sql-data-warehouse/">row level security blog</a></li> <li>And we close with a couple of <a href="/technologies/apache-hive/">Apache Hive</a> security announcements - <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=2018-11777">CVE-2018-11777</a>: Blocking local resource access in HiveServer2 and <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=2018-1314">CVE-2018-1314</a>: Hive explain query not being authorized</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/11/14/the-mid-week-news/</guid> </item> <item><title>Databricks Delta</title><link>https://ondataengineering.github.io/technologies/databricks-delta/</link><pubDate>Thu, 08 Nov 2018 00:00:00 +0000</pubDate> <description> <p>Storage layer for tabular structured data within the Databricks Unified Analytics Platform that supports ACID transactions and data skipping. Data is persisted to Amazon S3 or Azure Blob Storage as Parquet files with metadata stored in a Hive Metastore, and includes full integration with Spark Structured Streaming and Spark SQL. Supports batch appends, overwrites, updates, upserts and deletes and streaming appends or overwrites, with new data written as new delta files (with changes collapsed during reads) supported by a transaction log. Allows multiple writers able to simultaneously modify a dataset, and ensures readers are always presented with a consistent view through the use of snapshots. Includes support for a number of SQL management extensions, including viewing the transaction history (describe history), accessing previous versions of datafiles (by timestamp or version), collapsing delta files to improve performance (optimize) and removing old files left around to support snapshooted reads (vacuum). Supports performant reads through standard Hive partitioning (including support for partition pruning) and data skipping (reducing data read based on recorded min/max values for data files which can be enhanced by z ordering data). Also supports views over tables and backward compatible schema changes, including support for auto addition of new fields based on input data. Currently in preview, having been first announced in October 2018.</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Type</td><td>Commercial</td></tr> <tr><td>Last Updated</td><td>April 2019</td></tr> </tbody> </table> <h2 id="updates">Updates</h2> <ul> <li>2019-02-27 - <a href="https://databricks.com/blog/2019/02/19/new-databricks-delta-features-simplify-data-pipelines.html">blog post</a></li> <li>2019-04-09 - <a href="https://databricks.com/blog/2019/04/04/announcing-databricks-runtime-5-3.html">GA in Databricks Runtime 5.3</a></li> </ul> <h2 id="links">Links</h2> <ul> <li><a href="https://databricks.com/product/databricks-delta">https://databricks.com/product/databricks-delta</a> - homepage</li> <li><a href="https://docs.databricks.com/delta/index.html">https://docs.databricks.com/delta/index.html</a> - documentation</li> <li><a href="https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html">https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html</a> - initial blog post</li> <li><a href="https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html">https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html</a> - data skipping and z-ordering blog post</li> <li><a href="https://databricks.com/blog/2018/07/19/simplify-streaming-stock-data-analysis-using-databricks-delta.html">https://databricks.com/blog/2018/07/19/simplify-streaming-stock-data-analysis-using-databricks-delta.html</a> - streaming analysis blog post</li> </ul> <h2 id="news">News</h2> <ul> <li><a href="https://databricks.com/blog/category/company/product">https://databricks.com/blog/category/company/product</a> - Databricks product blog</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/databricks-delta/</guid> </item> <item><title>The Mid Week News 07/11/2018</title><link>https://ondataengineering.github.io/blog/2018/11/07/the-mid-week-news/</link><pubDate>Wed, 07 Nov 2018 07:30:00 +0000</pubDate> <description> <p>No technology updates this week, but a bunch of news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li>None this week</li> </ul> <p>Other technology news:</p> <ul> <li>From Hortonworks - <a href="/technologies/apache-hadoop/">Hadoop</a> is thriving! <a href="https://hortonworks.com/blog/apache-hadoop-thriving/">link</a></li> <li>And again from Hortonworks - more on their Open Hybrid Artitecture - <a href="https://hortonworks.com/blog/open-hybrid-architecture-bringing-cloud-native-to-on-premises/">link</a></li> <li>From Cloudera, how to protect yourself from the recent Hadoop malware attacks (and how trivially easy this is with Altus obviously) - <a href="http://blog.cloudera.com/blog/2018/11/protecting-hadoop-clusters-from-malware-attacks/">link</a></li> <li>From Google, an intro to <a href="/technologies/google-cloud-dataproc/">Cloud DataProc</a> - their Hadoop as a service offering - <a href="https://cloud.google.com/blog/products/data-analytics/run-apache-spark-and-apache-hadoop-workloads-with-flexibility-and-predictability-with-cloud-dataproc">link</a></li> <li>And again from Google - a comparison of <a href="/tech-categories/object-stores/">object stores</a> and <a href="/tech-categories/hadoop-compatible-filesystems/">HDFS</a> - <a href="https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips">link</a></li> <li>From Datanami - RockSet, a new SQL cloud service with automatic ingest and indexing of data, including streaming data, has emerged from stelth - <a href="https://www.datanami.com/2018/11/01/rockset-sql-cloud-service-emerges-from-stealth/">link</a></li> <li>And again from Datanami, there’s a new major version of Dremio out, and I’m hoping to do a tech summary on this shortly - <a href="https://www.datanami.com/2018/10/30/dremio-fleshes-out-data-platform/">link</a></li> <li>I’m not sure where I came across this, but I’ve added LeoFS to our <a href="/tech-categories/object-stores/">object stores</a> page - <a href="https://leo-project.net/leofs/">link</a></li> <li>Amazon Redshift is now 3.5x faster! <a href="https://aws.amazon.com/blogs/big-data/performance-matters-amazon-redshift-is-now-up-to-3-5x-faster-for-real-world-workloads/">link</a></li> <li>From Databricks, you can now do pivots in <a href="/technologies/apache-spark/spark-sql/">Spark SQL</a> - <a href="https://databricks.com/blog/2018/11/01/sql-pivot-converting-rows-to-columns.html">link</a></li> <li>DZone have a RefCard on temporal support in Oracle, SQL Server and MariaDB - <a href="https://dzone.com/refcardz/temporal-data-processing">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/11/07/the-mid-week-news/</guid> </item> <item><title>Thoughts On Schema Registries</title><link>https://ondataengineering.github.io/blog/2018/11/02/thoughts-on-schema-registries/</link><pubDate>Fri, 02 Nov 2018 07:45:00 +0000</pubDate> <description> <p>So I want to spend the next couple of weeks thinking about how you store structured data in unstructured data stores (specifically <a href="/tech-categories/streaming-data-stores/">Streaming Data Stores</a>, <a href="/tech-categories/hadoop-compatible-filesystems/">Hadoop Compatible File Systems</a> and <a href="/tech-categories/object-stores/">Object Stores</a>).</p> <p>This week, we’re looking at <a href="/tech-categories/streaming-data-stores/">Streaming Data Stores</a> <!--more--></p> <p>We need to look at this because messages in streaming data stores are just collections of bytes, with all serialisation and de-serialisation is done in the client, effectively meaning we’re doing schema on read/write. So the client needs to know what schema to use to serialise or de-serialise the data.</p> <p>With batch data on disk you could use self describing data - i.e. you embed the schema with the data using something like <a href="/technologies/apache-avro/">Avro</a>, however that doesn’t work for streaming data as it’s just not practical to embed the schema into every message, so your schema has to live somewhere else.</p> <p>You could embed the schema into you code, and with a bit of care you can have a shared schema that multiple jobs use. However this doesn’t allow anything outside of your code base to re-use this schema - if you’re using third party tools that don’t integrate with your configuration management tool (e.g. NiFi or StreamSets), then you have no way of sharing that schema.</p> <p>So the final option is to use a dedicated schema registry - a tool that will manage your schemas for you, and serve them up over an API based on the topic name that can be used by clients to fetch the schema when they want to read and write data. You will however need to make sure that the client or tool you’re using supports the schema registry. As an added bonus, they support schema compatibility checks, in that if you say you want your schemas to be forward or backward compatible (or both), it will validate that any changes to schemas conform to this.</p> <p>And for your schema registry you basically have two options at the moment.</p> <p>The first is the Confluent Schema Registry, part of the <a href="/technologies/confluent-open-source/">Confluent Open Source</a> bundle of <a href="/technologies/apache-kafka/">Apache Kafka</a>. It supports Avro schemas, and is integrated into Kafka APIs, Kafka Connect, Kafka Streams, NiFi and StreamSets.</p> <p>The second is a more recent addition, with Hortonworks’ open source <a href="/technologies/schema-registry/">Schema Registry</a> tool. It does broadly the same thing, supporting Avro schema, with integration into the Kafka APIs plus NiFi and their Streaming Analytics Manager. And it’s designed to be more general purpose, with support for serving templates, machine learning models or business rules mooted for the future.</p> <p>Next week we’ll take about schema management for data in <a href="/tech-categories/hadoop-compatible-filesystems/">Hadoop Compatible File Systems</a> and <a href="/tech-categories/object-stores/">Object Stores</a>) and the <a href="/technologies/apache-hive/hive-metastore/">Hive Metastore</a>, which fulfils a similar function, but with some extra complications.</p> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/11/02/thoughts-on-schema-registries/</guid> </item> <item><title>Schema Registries</title><link>https://ondataengineering.github.io/tech-categories/schema-registries/</link><pubDate>Fri, 02 Nov 2018 07:30:00 +0000</pubDate> <description> <p>Our list of and information on schema registries, including the Hive Metastore, the Confluent and Hortonworks Schema Registries, and alternatives to these.</p> <!-- Tech Category metadata --> <h2>Category Definition</h2> <!-- Tech Vendor metadata --> <p>Tools that support the definition, management and serving of <a href="/tech-categories/data-storage-formats/">Data Storage Format</a> schemas for use in the serialisation and de-serialisation of data, primarily with <a href="/tech-categories/streaming-data-stores/">Streaming Data Stores</a>. Will support an API (and often a web user interface) for managing and retrieving schemas, and will often support schema evolution checks (ensuring changes are forward or backward compatible or both), and an SDK that integrates with clients to allow structured data to be read and written directly. May also support serving of the libraries required to perform serialisation/de-serialisation, and high availability configurations. <!--more--></p> <h2 id="open-source-streaming-data-schema-registries">Open Source Streaming Data Schema Registries</h2> <table> <tbody> <tr> <td><a href="/technologies/confluent-open-source/">Confluent Schema Registry</a></td> <td>Central definition of schemas for reading and writing from/to Kafka topics, with support for a range of technologies (including the Kafka APIs, Kafka Connect, Kafka Streams, NiFi and StreamSets)</td> </tr> <tr> <td><a href="/technologies/schema-registry/">Hortonworks Schema Registry</a></td> <td>Central definition of Avro schemas for use in NiFi, Kafka Producers/Consumers and Streaming Analytics Manager</td> </tr> <tr> <td>Avro Schema Registry</td> <td>Compatible with Confluent’s Schema Registry API, but re-implemented in Ruby backed by Postgres - <a href="https://github.com/salsify/avro-schema-registry">https://github.com/salsify/avro-schema-registry</a></td> </tr> <tr> <td>Landoop Schema Registry UI</td> <td>Web based user interface for the Confluent Schema Registry - <a href="https://github.com/Landoop/schema-registry-ui">https://github.com/Landoop/schema-registry-ui</a></td> </tr> </tbody> </table> <h2 id="streaming-data-schema-registry-alternatives">Streaming Data Schema Registry Alternatives</h2> <p>Schemas can of course be managed and maintained if your configuration management tool, however these will not be available outside of your code base (e.g. to third party tools such as NiFi or StreamSets).</p> <p>Cloudera have a three part blog post on how to roll your own schema management tool for <a href="/technologies/apache-kafka/">Kafka</a> using a Kafka topic to store your schemas - <a href="http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-1/">part1</a>; <a href="http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-2/">part2</a>; <a href="http://blog.cloudera.com/blog/2018/08/robust-message-serialization-in-apache-kafka-using-apache-avro-part-3/">part3</a></p> <h2 id="hive-metastore">Hive Metastore</h2> <p>The <a href="/technologies/apache-hive/hive-metastore/">Hive Metastore</a> fulfils a similar function for data stored in <a href="/tech-categories/hadoop-compatible-filesystems/">Hadoop Compatible File Systems</a> and <a href="/tech-categories/object-stores/">Object Stores</a>, however serves a wider range of table metadata (including how it’s structured on disk), and doesn’t include some features like schema lifecycle management.</p> <p><a href="https://www.slideshare.net/Hadoop_Summit/sharing-metadata-across-the-data-lake-and-streams-103204119">This presentation</a> from Hortonworks describes their view of the future of the Hive Metastore, including it’s separation from Hive and integration with the schema registry.</p> </description> <guid isPermaLink="true">https://ondataengineering.github.io/tech-categories/schema-registries/</guid> </item> <item><title>The Mid Week News 31/10/2018</title><link>https://ondataengineering.github.io/blog/2018/10/31/the-mid-week-news/</link><pubDate>Wed, 31 Oct 2018 07:30:00 +0000</pubDate> <description> <p>ok - let’s do this week’s news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-nifi/">Apache Nifi</a> has hit 1.8</li> <li><a href="/technologies/apache-beam/">Apache Beam</a> has hit 2.8</li> <li><a href="/technologies/apache-kudu/">Apache Kudu</a> is up to 1.8</li> </ul> <p>Other technology news:</p> <ul> <li>Following on from our thinking on Kubernetes, Confluent have a post on running <a href="/technologies/apache-kafka/">Kafka</a> on Kubernetes - <a href="https://www.confluent.io/blog/apache-kafka-kubernetes-could-you-should-you">link</a></li> <li>From InfoQ, an interview on <a href="/technologies/azure-hdinsight/">HDSight</a> - <a href="https://www.infoq.com/news/2018/10/HDInsight-Chatterjee">link</a></li> <li>Dremio 3.0 is out and ZDNet have a write up - <a href="https://www.zdnet.com/article/dremio-3-0-adds-catalog-containers-enterprise-features/">link</a></li> <li>From Datanami, it looks like there’s a bot out targeting unsecured <a href="/technologies/apache-hadoop/">Hadoop</a> clusters - <a href="https://www.datanami.com/2018/10/29/latest-bot-targets-hadoop-clusters/">link</a></li> <li>Still more on the Cloudera Hortonworks merger, this time Datanami looking at the product roadmap - <a href="https://www.datanami.com/2018/10/24/new-cloudera-plots-a-course-toward-a-unified-future/">link</a></li> <li>Pinot, the Open source realtime distributed OLAP datastore from LinkedIn, has been accepted into the Apache Incubator - <a href="http://incubator.apache.org/projects/pinot.html">http://incubator.apache.org/projects/pinot.html</a></li> <li>IoTDB (a new massive scale IoT time series DB) and Sharding Sphere (middleware for distributed databases) have both been submitted to the Apache Incubator - <a href="https://wiki.apache.org/incubator/IoTDBProposal">IoTDB</a>; <a href="https://wiki.apache.org/incubator/ShardingSphereProposal">ShardingSphere</a></li> <li><a href="/technologies/apache-impala/">Apache Impala</a> has two security announcements (fixed in 3.0.1) - <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=2018-11785">CVE-2018-11785</a> and <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=2018-11792">CVE-2018-11792</a></li> <li>More from Data Artisans on stateful processing with <a href="/technologies/apache-flink/">Apache Flink</a> - <a href="https://data-artisans.com/blog/stateful-stream-processing-apache-flink-state-backends">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/10/31/the-mid-week-news/</guid> </item> <item><title>The Mid Week News 24/10/2018</title><link>https://ondataengineering.github.io/blog/2018/10/24/the-mid-week-news/</link><pubDate>Wed, 24 Oct 2018 07:30:00 +0000</pubDate> <description> <p>It’s Wednesday, which means it’s time to catch up on the news… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-carbondata/">Apache CarbonData</a> is up to 1.5</li> <li><a href="/technologies/greenplum/">Greenplum</a> has hit 5.12</li> <li><a href="/technologies/hortonworks-data-platform-search/">HDP Search</a> has a 4.0 release</li> <li><a href="/technologies/hue/">Hue</a> has hit 4.3</li> </ul> <p>Other technology news:</p> <ul> <li>There’s more views on the Cloudera - Hortonworks merger, this time with a more negative view, including a post from MapR brilliantly titled “Two Wrongs Don’t Make A Right” - <a href="https://mapr.com/blog/two-wrongs-dont-make-a-right/">MapR</a>; <a href="https://www.bloorresearch.com/2018/10/cloudera-and-hortonworks-what-do-i-think/">Bloor Research</a>; <a href="https://www.datanami.com/2018/10/18/is-hadoop-officially-dead/">Datanami</a></li> <li>Teradata have been talking about their new strategy and there’s commentry - <a href="https://www.zdnet.com/article/is-teradatas-march-to-cloud-and-commodity-its-destiny/">ZDNet</a>; <a href="https://www.datanami.com/2018/10/17/inside-teradatas-audacious-plan-to-consolidate-analytics/">Datanami</a></li> <li>And a view on SAP, this time from The Register - <a href="https://www.theregister.co.uk/2018/10/18/sap_q3_reults/">link</a></li> <li>The Register have a summary of the latest Gartner Object and Distributed File Storage Magic Quadrant - <a href="https://www.theregister.co.uk/2018/10/23/eight_suppliers_move_in_gartners_latest_object_mq/">link</a></li> <li>This was interesting - from Cloudera a post on the work they do to manage and rationalise third party library dependancies in Hadoop - <a href="http://blog.cloudera.com/blog/2018/10/third-party-libraries-in-c6/">link</a></li> <li>More InfoQ, then best practices for deploying <a href="/technologies/apache-kafka/">Kafka</a> - <a href="https://www.infoq.com/articles/apache-kafka-best-practices-to-optimize-your-deployment">link</a></li> <li>There’s a big post on the internals of <a href="/technologies/pravega">Pravega</a>, the Kafka challenger - <a href="http://blog.pravega.io/2018/10/17/pravega-internals/">link</a></li> <li>And <a href="/technologies/azure-hdinsight/">HDInsight</a> now supports caching of object storage using the RubiX framework open sourced by Qubole - <a href="https://azure.microsoft.com/en-us/blog/apache-spark-speedup-with-hdinsight-io-cache/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/10/24/the-mid-week-news/</guid> </item> <item><title>The Future of Hadoop...</title><link>https://ondataengineering.github.io/blog/2018/10/19/the-future-of-hadoop/</link><pubDate>Fri, 19 Oct 2018 07:30:00 +0000</pubDate> <description> <p>Ok - this is our last post on Hadoop for the time being, so let’s try and predict the future… <!--more--></p> <p>We started talking about the future of Hadoop when we looked at Hortonworks’ Open Hybrid Architecture Initiative <a href="/blog/2018/09/21/thoughts-on-hortonworks-open-hybrid-architecture-initiative/">a few weeks ago</a>. In that we talked about the current positioning of Hadoop as a multi-purpose storage and compute platform for analytics, the move towards the separation of compute and storage, and the containerisation of Hadoop and the workloads that run on it.</p> <p>I’m going to try to not repeat any of that, but today I do want to imagine what Hadoop might look like in 3 to 5 years time. And we’ll base this on the assumption that separation of storage and compute is going to happen, i.e. services will not assume or need these to be co-located.</p> <p>So, in not particular order, some predictions:</p> <p>HDFS (or more specifically the HCFS specification) will become the standard for accessing volumes of data for analytics. It’s almost there already, but there are a few kinks to be worked out with accessing data in object store technologies, and maybe even remote HDFS clusters.</p> <p>HDFS will evolve into a multi-purpose clustered storage platform. This is obviously the direction being driven by Ozone, giving HDFS a generic underlying block storage layer on which HDFS, object store and other access routes (streaming data, NoSQL etc.) can be layered. This is a pretty competitive market however, not only are all the big storage vendors playing in this space, but MapR have this tech already - the <a href="/technologies/mapr-fs/">MapR-FS</a> supports HDFS, NFS, S3 (via an embedded Minio instance), Kafka (via <a href="/technologies/mapr-es/">MapR-ES</a>) and NoSQL document and wide column (via <a href="/technologies/mapr-db/">MapR-DB</a>) APIs - and there’s a bunch of specialist storage and database technology vendors in this space.</p> <p>YARN as a standard for resource management for compute jobs will go, replaced by Kubernetes. This is starting to happen already - Spark and Flink already support Kubernetes, with jobs executed through the deployment of a set of transient dockerised processes. Kubernetes will need to beef up it’s schedulers a bit, but it will be the primary standard that cluster processing frameworks use to orchestrate their jobs. And note that this isn’t running Hadoop on Kubernetes (that some people like BlueData and Robin) can deliver today - this is about running individual transient compute jobs on Kubernetes rather than a persistent Hadoop cluster (that then does it’s own local resource management).</p> <p>YARN as a standard for resource management of long lived processes will die, probably even before it gets going. There’s been efforts to make long live services on Hadoop (HBase, Kafka etc.) integrate with YARN for a while (through Hoya, Koya to <a href="/technologies/apache-slider/">Apache Slider</a>), and the latest version of YARN has support for long lived processes built in. But why integrate all the persistent services that run on Hadoop (Solr, Impala, Kudu, HBase etc.) with YARN when you can just deploy them using containers and let a container management layer handle the resource management. And note that for most of these long running services they’re pretty much good to run on docker containers today, as long as those that require HDFS are happy running over external object storage.</p> <p>YARN as a cluster resource manager will (probably) go. The reason I say probably is that you’ll still be able to buy a physical Hadoop platform, that will support multi-purpose storage (with an HCFS interface) and compute management (with a Kubernetes interface). However, will this be an evolution of YARN under the covers, or as I suspect, will the brave decision be made to ditch it and just integrate a Kubernetes distribution.</p> <p>And that leaves (for me) the most important and valuable part of Hadoop - the shared services - metadata, security, audit, data management. They’re still immature, but my hope is that these become a set of standard integration points that a wide range of analytical tools use to enable centralised management and governance of your analytics ecosystem. And (finally) there will be some standardisation of the tooling in this space - whether it’s Sentry or Ranger, Altas or Navigator, or DataPlane will remain to be seen.</p> <p>So what does that mean Hadoop will be in five years time? It’ll be a collection of analytical platforms and tools, bound together by some common standards for storage, compute, security, management and governance, that can be individually deployed in the public cloud or on premesis, with the option of using local cloud tech or an out of the box physical cluster on premises.</p> <p>And an offering of a wide range of analytical tools that integrate together, where management, maintenance and governance costs are minimised, and that can be deploy wherever and however you want sounds pretty good to me.</p> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/10/19/the-future-of-hadoop/</guid> </item> <item><title>Thoughts On Hadoop Service Providers</title><link>https://ondataengineering.github.io/blog/2018/10/17/thoughts-on-hadoop-service-providers/</link><pubDate>Wed, 17 Oct 2018 07:45:00 +0000</pubDate> <description> <p>So last week we looked at a bunch of cloud Hadoop service providers. Time to recap and expouse some thoughts… <!--more--></p> <p>Deploying Hadoop is hard and complex, especially if you’re looking to make it secure, stable and implement a bunch of best practice stuff. And wouldn’t it be nice if we could automate the process and make it repeatable, allowing us to create clusters when needed and tear them down when we’re done?</p> <p><a href="/technologies/amazon-emr/">Amazon EMR</a>, <a href="/technologies/azure-hdinsight/">Azure HDInsight</a> and <a href="/technologies/google-cloud-dataproc/">Google Cloud Dataproc</a> are the big cloud providers offerings in this space, and other cloud providers have similar offerings. Fundamentally they allow you to programatically specify and create an Hadoop cluster with one or more services pre-installed. They all support a bunch of standard stuff - some sort of selection of the Hadoop services you want pre-installed, a bunch of automatic configuration of Hadoop, streamlined usage of cloud storage (with encryption), custom bootstrap actions etc., and are all priced as a premium on the costs for the raw storage and compute that you consume.</p> <p>But using any of these are going to tie you into those cloud platforms and their distributions of Hadoop. EMR and Cloud Dataproc both run their own distributions (although with some exploitation of <a href="/technologies/apache-bigtop/">Apache BigTop</a>) - Google’s is more limited (Spark, MapReduce, Pig &amp; Hive), EMR much broader (adding a range of other techs, including Flink, Presto, TensorFlow, Hue and Zeppelin). HDInsight is based on the Hortonworks Data Platform, and is the broadest of the lot adding in Kafka, Storm, Hive LLAP and (with an enterprise security add on) Ranger security. And although you’re tied into the cloud vendor, this is not neccesarily a bad thing - you get full integration with their security, audit and management tools, and if you’re going all in with a cloud vendor this can make a lot of sense.</p> <p>The alternative is you align yourself to an “independant” Hadoop vendor - you’re still aligning your self to a distribution, but you now have the freedom to deploy it wherevers most appropriate - on premises, or on whichever cloud vendor works for you. Your options here were (primarily) <a href="/tech-vendors/cloudera/">Cloudera</a>, <a href="/tech-vendors/hortonworks/">Hortonworks</a> and <a href="/tech-vendors/mapr">MapR</a>, however Cloudera nand Hortonworks have just announced they’re planning to merge. And all these options have tooling to programatically deploy Hadoop on cloud infrastructure, giving a similar experience to the cloud vendor offerings. For Cloudera it’s <a href="/technologies/cloudera-altus/director/">Director</a>, Hortonworks have <a href="/technologies/cloudbreak/">Cloudbreak</a> and MapR the MapR Orbit Cloud Suite.</p> <p>But all these options are targeted at Hadoop administrators - a lot is automated, but you still need to have a pretty deep understanding of Hadoop, you’re still responsible for managing (starting, stopping and scaling) your cluster (although EMR, Dataproc and Cloudbreak have some support for auto scaling), and you’ll need to be comfortable by customising your cluster through bootstrap scripts.</p> <p><a href="/technologies/qubole-data-service/">Qubole Data Service</a> feels like it’s targeting a slightly different market - in that it tries to automate as much of the cluster management as possible. You still need to spec your cluster and select the cloud infrastructure you want to run it on, but it will then manage it for you - automatically starting, stopping and scaling it based on the current workload to make sure youre cloud infrastructure costs are being minimised. And it also works hard to provide a much richer user interface, allowing analysts to manage their data in the cloud (including ingesting it and push it back to a wide range of cloud databases) and giving them rich query/job editors and Zeppelin notebooks. It feels more like the vision of an Haodop managed service.</p> <p>Which leaves us with <a href="/technologies/cloudera-altus/">Cloudera Altus</a>. This feels like Cloudera’s attempt to at a slightly higher level service. Unlike Qubole however, it has a range of offerings targeted at slightly different user communities, but it feels like it’s trying to differentiate itself from the traditional cloud Hadoop offerings.</p> <p>As always - see our <a href="/tech-categories/hadoop-distributions/">Hadoop Distributions</a> page for our full list of on premises and cloud based Hadoop distributions.</p> <p>Right - that’s it for today. One more post this week on Hadoop and then we’ll move on to something new.</p> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/10/17/thoughts-on-hadoop-service-providers/</guid> </item> <item><title>The Mid Week News 17/10/2018</title><link>https://ondataengineering.github.io/blog/2018/10/17/the-mid-week-news/</link><pubDate>Wed, 17 Oct 2018 07:30:00 +0000</pubDate> <description> <p>It’s news time again… <!--more--></p> <p>Technology updates (details are on the relevant technology pages):</p> <ul> <li><a href="/technologies/apache-accumulo/">Apache Accumulo</a> has a 2.0 alpha release out</li> <li>Hortonworks DataPlane <a href="/technologies/hortonworks-dataplane/data-lifecycle-manager/">Data Lifecycle Manager</a> has hit 1.2</li> <li>Hortonworks DataPlane <a href="/technologies/hortonworks-dataplane/streams-messaging-manager/">Streams Messaging Manager</a> has hit 1.1</li> </ul> <p>Other technology news:</p> <ul> <li>Hortonworks have a couple of posts on <a href="/technologies/apache-hadoop/ozone/">Ozone</a> - <a href="https://hortonworks.com/blog/apache-hadoop-ozone-object-store-overview/">overview</a>; <a href="https://hortonworks.com/blog/apache-hadoop-ozone-object-store-architecture/">architecture</a></li> <li>Confluent have another post on how event driven architectures are the future, but there’s some interesting stuff there on on-demand data stores that could be interesting from an analytics point of view - <a href="https://www.confluent.io/blog/event-driven-2-0">link</a></li> <li>From Data Artisans - checkpointing and Kafka offsets with <a href="/technologies/apache-flink/">Apache Flink</a> - <a href="https://data-artisans.com/blog/how-apache-flink-manages-kafka-consumer-offsets">link</a></li> <li>A nice intro on ingestion of data into Hadoop using <a href="/technologies/streamsets-data-collector/">StreamSets</a> - <a href="https://streamsets.com/blog/modernizing-hadoop-ingest-beyond-flume-and-sqoop/">link</a></li> <li>Looks like Snowflake has just secured a bunch more funding - <a href="https://www.theregister.co.uk/2018/10/11/tricorn_snowflake_pockets_450m_in_another_massive_funding_round/">link</a></li> <li><a href="/technologies/google-cloud-storage/">Google Cloud Storage</a> is seeing some small changes - <a href="https://cloud.google.com/blog/products/storage-data-transfer/store-it-analyze-it-back-it-up-cloud-storage-updates-bring-new-replication-options">link</a></li> <li>From Datanami, an update on Teradata strategy - <a href="https://www.datanami.com/2018/10/09/new-teradata-focuses-on-answers-not-analytics/">link</a></li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/blog/2018/10/17/the-mid-week-news/</guid> </item> <item><title>Qubole Data Service</title><link>https://ondataengineering.github.io/technologies/qubole-data-service/</link><pubDate>Fri, 12 Oct 2018 00:00:00 +0000</pubDate> <description> <p>Hadoop as a managed service over AWS, Azure, Google Gloud Platform and Oracle Cloud. Supports Airflow, Hadoop, Presto and Spark cluster types, automatic management (starting, stopping and scaling) of clusters based on workload, automatic shared Hive metastores within accounts, role based access control (to accounts, clusters and UI/API functionality, with Hive authorisation to manage access to data), connectivity to external databases (Data Stores), labelling of clusters and routing of commands by label (allowing graceful cluster upgrades), custom node bootstrap commands, encryption, auditing, data caching (on AWS only via open source Rubix project), ODBC/JDBC drives. Has a rich web based user interface that supports exploration of data (in Hadoop, object stores and connected external databases), a command composer with auto completion (supporting Hive, Presto, Pig, Shell, Spark and Worklow commands) with auto completion and command history, parameterisable command templates, data management (import, export and upload), a visual query builder (Smart Query), Zeppelin based notebooks (including publication of public read only notebook views), command schedulers, cluster management and a range of usage and cluster metrics and graphs. Also supports a REST API. Priced per hour based on the cloud infrastructure being used, which is in addition to any cloud vendor costs. Launched in 2013.</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Other Names</td><td>QDS</td></tr> <tr><td>Type</td><td>Commercial</td></tr> <tr><td>Last Updated</td><td>May 2019 - R56 (AWS/Azure/Oracle)</td></tr> </tbody> </table> <h2>Related Technologies</h2> <table> <tbody> <tr><td>Packages</td><td><a href="/technologies/apache-airflow/">Apache Airflow</a>, <a href="/technologies/apache-hadoop/">Apache Hadoop</a>, <a href="/technologies/apache-hive/">Apache Hive</a>, <a href="/technologies/apache-pig/">Apache Pig</a>, <a href="/technologies/apache-spark/">Apache Spark</a>, <a href="/technologies/apache-sqoop/">Apache Sqoop</a>, <a href="/technologies/apache-tez/">Apache Tez</a>, <a href="/technologies/apache-zeppelin/">Apache Zeppelin</a>, <a href="/technologies/presto/">Presto</a>, TensorFlow</td></tr> </tbody> </table> <h2 id="release-history">Release History</h2> <table> <tbody> <tr> <td>version</td> <td>release date</td> <td>release links</td> <td>release comment</td> </tr> <tr> <td>R54</td> <td>2018-11-29</td> <td><a href="https://www.qubole.com/blog/release-54/">blog post</a>; <a href="https://docs.qubole.com/en/latest/release-notes/releasenotesR54/index.html">AWS</a>; <a href="https://docs.qubole.com/en/latest/release-notes/releasenotes-AzureR54/index.html">Azure</a>; <a href="https://docs.qubole.com/en/latest/release-notes/releasenotes-OracleR54/index.html">Oracle</a></td> <td> </td> </tr> <tr> <td>R55 (AWS)</td> <td>2019-02-06</td> <td><a href="https://docs.qubole.com/en/latest/release-notes/releasenotesR55/index.html">release notes</a></td> <td> </td> </tr> <tr> <td>R55 (Azure)</td> <td>2019-03-15</td> <td><a href="https://docs.qubole.com/en/latest/release-notes/releasenotes-AzureR55/index.html">release notes</a></td> <td> </td> </tr> <tr> <td>R55 (Oracle)</td> <td>2019-03-15</td> <td><a href="https://docs.qubole.com/en/latest/release-notes/releasenotes-OracleR55/index.html">release notes</a></td> <td> </td> </tr> <tr> <td>GCP</td> <td>2019-04-10</td> <td><a href="https://www.qubole.com/blog/qubole-google-deliver-unified-user-experience/">announcement</a>; <a href="https://www.qubole.com/blog/technical-overview-of-qubole-on-gcp/">tech overview</a></td> <td> </td> </tr> <tr> <td>R56 (AWS)</td> <td>2019-05-08</td> <td><a href="https://docs.qubole.com/en/latest/release-notes/releasenotesR56/index.html">release notes</a></td> <td> </td> </tr> <tr> <td>R56 (Oracle)</td> <td>2019-05-29</td> <td><a href="https://docs.qubole.com/en/latest/release-notes/releasenotes-OracleR56/index.html">release notes</a></td> <td> </td> </tr> <tr> <td>R56 (Azure)</td> <td>2019-05-30</td> <td><a href="https://docs.qubole.com/en/latest/release-notes/releasenotes-AzureR56/index.html">release notes</a></td> <td> </td> </tr> </tbody> </table> <h2 id="links">Links</h2> <ul> <li><a href="https://www.qubole.com/">https://www.qubole.com/</a> - homepage</li> <li><a href="https://docs.qubole.com/en/latest/">https://docs.qubole.com/en/latest/</a> - docs</li> <li><a href="https://docs.qubole.com/en/latest/admin-guide/osversionsupport.html">https://docs.qubole.com/en/latest/admin-guide/osversionsupport.html</a> - supported component versions</li> <li><a href="https://rubix.readthedocs.io/en/latest/">https://rubix.readthedocs.io/en/latest/</a> - Rubix</li> </ul> <h2 id="news">News</h2> <ul> <li><a href="https://www.qubole.com/blog/">https://www.qubole.com/blog/</a> - Qubole blog</li> <li><a href="https://docs.qubole.com/en/latest/release-notes/index.html">https://docs.qubole.com/en/latest/release-notes/index.html</a> - release notes</li> </ul> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/qubole-data-service/</guid> </item> <item><title>Google Cloud DataProc</title><link>https://ondataengineering.github.io/technologies/google-cloud-dataproc/</link><pubDate>Thu, 11 Oct 2018 00:00:00 +0000</pubDate> <description> <p>Service for dynamically provisioning Hadoop clusters on Google Compute Engine based on a single standard set of Hadoop services. Supports selection of virtual machines (including custom machine types and machines with GPUs), usage of custom VM images, a claimed cluster startup time of less than 90 seconds, local storage and HDFS filesystem, programmatic execution of jobs, workflows (parameterisable operations that create clusters, run jobs and then delete the cluster), manual and automatic scaling, initialisation actions (to install extra services or run scripts, with a set of open source actions available), optional components (automatic addition of extra services), automatic deletion of clusters (based on time, usage or idleness), integration with Stackdriver Logging and Monitoring and encryption of data in HDFS and Cloud Storage. Manageable via the Google Cloud Console Web UI and SDK plus an RPC and REST API. Priced an an hourly rate (charged per second) based on the specification of the VMs being used, which is in addition to any Compute Engine or Persistent Disk charges.</p> <!-- Tech Category metadata --> <h2>Technology Information</h2> <table> <tbody> <tr><td>Other Names</td><td>Google DataProc, DataProc</td></tr> <tr><td>Type</td><td>Commercial</td></tr> <tr><td>Last Updated</td><td>October 2018 - v1.3</td></tr> </tbody> </table> <h2>Related Technologies</h2> <table> <tbody> <tr><td>Packages</td><td><a href="/technologies/apache-hadoop/">Apache Hadoop</a>, <a href="/technologies/apache-hive/">Apache Hive</a>, <a href="/technologies/apache-pig/">Apache Pig</a>, <a href="/technologies/apache-spark/">Apache Spark</a>, <a href="/technologies/apache-tez/">Apache Tez</a></td></tr> </tbody> </table> <h2 id="links">Links</h2> <ul> <li><a href="https://cloud.google.com/dataproc/">https://cloud.google.com/dataproc/</a> - homepage</li> <li><a href="https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions">https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions</a> - bundle services version list</li> <li><a href="https://cloud.google.com/dataproc/docs/release-notes">https://cloud.google.com/dataproc/docs/release-notes</a> - release notes</li> <li><a href="https://cloud.google.com/dataproc/docs/">https://cloud.google.com/dataproc/docs/</a> - documentation</li> <li><a href="https://github.com/GoogleCloudPlatform/dataproc-initialization-actions">https://github.com/GoogleCloudPlatform/dataproc-initialization-actions</a> - open source initialization actions</li> </ul> <h2 id="news">News</h2> <p>See <a href="/tech-vendors/google-cloud-platform/">Google Cloud Platform</a> updates</p> </description> <guid isPermaLink="true">https://ondataengineering.github.io/technologies/google-cloud-dataproc/</guid> </item> </channel> </rss>