您如何处理数据湖的温床

Big Data has a fancy term to describe an architecture for data: data lake. Without getting into an in-depth discussion about the precise definition of “data lake,” I have found surprising analogies to actual, real-world lakes in both ecology and environmental science. One of these is the term thermocline.

大数据有一个花哨的术语来描述数据架构:数据湖。 在没有深入讨论“数据湖”的精确定义的情况下,我发现了在生态学和环境科学方面与实际,现实世界中的湖泊令人惊讶的类比。 其中之一是术语温跃层

什么是“ Thermoclines”(What are “Thermoclines”span style=”font-weight: bold;”>)

In a lake, a thermocline is a layer that separates the water above from the water below. (A much more precise definition is on Wikipedia.) The characteristics most interesting to discussions of Big Data are:

在湖泊中,温跃层是将上方的水与下方的水分隔开的一层。 (更精确的定义在Wikipedia上。 )大数据讨论最有趣的特征是:

  • The water above the thermocline is much warmer and is more ecologically vibrant — it has more organisms.

    温跃层上方的水温度更高,并且在生态上更加活跃-它具有更多的生物。

  • The water below the thermocline is cold and still. Sunlight does not penetrate. The water has more oxygen because fewer organisms have existed to consume it. There may be more nutrients because dead organisms may sink to this layer and remain unconsumed.

    温水线以下的水冷且静止。 阳光不会穿透。 水具有更多的氧气,因为存在更少的生物来消耗它。 可能会有更多的营养,因为死亡的生物可能会沉入这一层而未被食用。

  • In colder climates, the thermocline disappears in the winter. The warm water above and the cold water below mix together. The cold water below is more nutrient- and oxygen-rich, and when it surfaces, it leads to vibrant blooms of phytoplankton, as in the Arctic or Antarctica.

    在较冷的气候下,温床在冬天消失。 上方的温水和下方的冷水混合在一起。 下方的冷水富含营养和氧气,当它浮出水面时,会导致浮游植物绽放,如在北极或南极洲。

您数据湖中的隐藏营养 (The hidden nutrients in your data lake)

What we want are the phytoplankton blooms. In data lakes, the equivalent is blossoms of innovation and insight that strike a data team and the whole organization.

我们想要的是浮游植物的花朵。 在数据湖中,这等同于创新和洞察力的发展,这些创新和见识打击了数据团队和整个组织。

How do we create the conditions for these bloomsBy cyclically dissolving the naturally-occurring thermoclines which exist in every data lake. And to do this, we must understand how thermoclines arise and how to dissolve them.

我们如何为这些开花创造条件通过周期性地溶解存在于每个数据湖中的天然存在的跃线。 为此,我们必须了解温跃层是如何产生的以及如何溶解它们的。

“ 80/20规则” (The “80/20 Rule”)

In data lakes, teams access 20% of the data 80% of the time. This power-law distribution arises from three factors:

在数据湖中,团队80%的时间访问20%的数据。 此幂律分布来自三个因素:

  1. Leaders are loss averse, repeatedly asking the same questions.

    领导者不愿遭受损失,反复问同样的问题。

  2. The organization “remembers” how losses occurred in the past. Data scientists fall into the trap of tracing and tracking only those sources.

    该组织“记住”过去的损失情况。 数据科学家陷入了仅追踪和跟踪那些来源的陷阱。

  3. The physics of organizational memory narrows the data that data scientists bother to investigate.

    组织记忆的物理范围缩小了数据科学家费心调查的数据。

Data teams not only access 20% of the data, but only ever attempt to answer 20% of questions of value, and in that data, only look at the 20% relevant to assumed loss factors. Teams only realize a tiny portion of potential value.

数据团队不仅访问20%的数据,而且仅尝试回答20%的价值问题,而在这些数据中,仅查看与假定损失因子相关的20%。 团队只意识到潜在价值的一小部分。

Repeatedly asking and answering the same questions also leads to diminishing returns. Knowing immediately when the watched pot boils may be immensely valuable. But other research has value too.

反复问和回答相同的问题也会导致收益递减。 立即了解所观察到的锅何时沸腾可能非常有价值。 但是其他研究也有价值。

生活在祖先的恐惧中 (Living in ancestral fear)

Imagine standing watch on a wall of a settlement in the savanna.

想象在大草原上定居点的墙上站立的手表。

  • Of prime importance, of course, is identifying threats to the tribe.

    当然,最重要的是确定对部落的威胁。

  • Based on historical precedent, the leaders of the tribe have learned not to look to factors such as threats to water sources, but instead focus on predators.

    根据历史先例,部落的领导者学会了不要去关注诸如对水源的威胁之类的因素,而要着眼于掠食者。

  • The only species of interest are lions.

    唯一感兴趣的物种是狮子。

  • Because lions are not flying creatures, those who guard the walls scan the horizon at ground-level, and not the skies above.

    由于狮子不是飞行中的动物,所以那些守卫墙壁的人会在地面而非地平线上扫描地平线。

The tribe has focused its resources on a narrow set of problems!

部落将其资源集中在一系列狭窄的问题上!

In my experience, many organizational leaders focus on dashboards the same way that settlements patrol their walls. And dashboards can be seen as a living repository of trauma — every loss experienced by an organization becomes a monitor.

以我的经验,许多组织负责人都专注于仪表板,就像定居点巡逻墙一样。 仪表板可以看作是不断遭受创伤的存储库-组织遭受的每一次损失都将成为监控者。

错失良机 (Missed opportunities)

There is extreme value in the organizational memory of trauma. And the dynamics behind loss aversion, reflexive conditioning, and organizational memory are beyond the scope of this post. To be clear, I am not attempting to persuade organizations to somehow “embrace fear” and throw away their dashboards! “Yes, And!”

组织的创伤记忆具有极高的价值。 损失厌恶,反身条件和组织记忆背后的动态超出了本文的范围。 明确地说,我并不是要说服组织以某种方式“拥抱恐惧”并丢弃其仪表板! “是的,而且!”

When an organization is driven solely by loss aversion questions, it loses the opportunity for radical and revolutionary innovation. Loss aversion can only ever be linear and evolutionary.

当一个组织完全受损失规避问题的驱动时,它将失去进行彻底和革命性创新的机会。 损失厌恶只能是线性的和演化的。

Radical and revolutionary innovation for data teams looks a lot like basic research, as opposed to the applied research of loss aversion.

数据团队的根本性和革命性创新看起来很像基础研究,而不是损耗规避的应用研究

Basic research requires a shift in consciousness. A data team must step out of the emotion of fear into an emotional detachment that allows the cognitive mind to function unhindered.

基础研究需要意识的转变。 数据团队必须走出恐惧的情绪,进入情绪分离,使认知思维不受阻碍地发挥作用。

  • Applied research in loss aversion would be “What shopping cart contents do prospective customers have when they abandonThis question is immediately relevant to the organization and is concretely actionable.

    在损失规避方面的应用研究将是“潜在客户放弃时有哪些购物车内容这个问题与组织直接相关,并且是切实可行的。

  • Basic research that seeks knowledge for its own sake would be, “What do customers post on Instagram after receiving their ordersWhether this is relevant to the organization is speculative at best. It may or may not be actionable. Yet the insights may be an opportunity for radical growth.

    出于自身利益而寻求知识的基础研究将是:“客户在收到订单后会在Instagram上发布什么这是否与组织相关最多只是推测。 它可能会或可能不会生效。 然而,这些见解可能是实现根本性增长的机会。

去除阳光 (Removing sunlight)

In natural lakes, sunlight creates the thermocline. Water near the surface is warm because of sunlight, and deeper water is cool because sunlight fades with depth. This temperature gradient creates a partition.

在天然湖泊中,阳光会形成温床。 由于阳光的照射,地表附近的水是温暖的,而较深的水则因为阳光随深度而褪色而变凉。 此温度梯度会创建一个分区。

In the winter, the reduction of sunlight removes the temperature difference — all water is cold, regardless of depth. So the thermocline disappears.

在冬季,减少阳光可以消除温度差-所有水都是冷的,而不管深度如何。 因此,温床消失了。

In data lakes, the attention and focus created by loss aversion questions and the impact of organizational memory lead to 20% of data being “warm.”

在数据湖中,由损失规避问题引起的关注和焦点以及组织记忆的影响导致20%的数据“变暖”。

In data lakes, we can eliminate the thermocline by eliminating sunlight. Remove the focus on loss aversion questions and push back against the focusing effect of organization memory. Again, this is not a call to stop the 80/20 effect. Data teams should create regular opportunities to cycle between loss aversion research and basic research. Take three days per month or a week per quarter for this work.

在数据湖中,我们可以通过消除阳光来消除温跃层。 取消对损失规避问题的关注,并向后退以消除组织记忆的关注效果。 再次,这不是要求停止80/20效果。 数据团队应创造定期的机会,在损失规避研究和基础研究之间循环。 每月需要三天或每季度一个星期进行这项工作。

Every organization has its optimal balance for loss-aversion vs. more open-ended curiosity. Be deliberate and strategic in choosing that balance. Many factors influence this choice.

每个组织在避免损失方面都拥有最佳平衡,而好奇心则更大。 选择平衡时要谨慎而有策略。 许多因素影响此选择。

变得好奇 (Become curious)

  1. Ask open-ended questions divorced from critical organizational objectives. Know for the sake of knowing.

    提出与关键组织目标背道而驰的开放性问题。 为了知道而知道。

  2. Frame questions in terms of compare and contrast.

    根据比较和对比来构想问题。

  3. Prevent data scientists from accessing the tables or files they regularly use. Force them to become familiar with new datasets.

    防止数据科学家访问他们经常使用的表或文件。 强迫他们熟悉新的数据集。

All of these are simultaneously required to break out of the 80/20 trap.

所有这些都需要同时突破80/20陷阱。

开放式问题解决障碍 (Open-ended questions dissolve blocks)

When asking open-ended questions, a common approach is hackathons. Hackathons are a mistake: they are too open-ended, rarely yielding value.

当问开放式问题时,黑客马拉松是一种常见的方法。 黑客马拉松是一个错误:他们过于开放,很少产生价值。

Choose a topic such as, “Who do our customers talk toThis question is somewhat bounded, yet broad enough for an infinitely diverse set of research.

选择一个主题,例如“我们的客户与谁交谈这个问题有一定的局限性,但对于一组无限多样的研究来说已经足够广泛了。

The objective of the open-ended topic is two-fold:

开放式主题的目的有两个:

  • Engender curiosity and a sense of mystery about the organization’s ecosystem.

    员工的好奇心和对组织生态系统的神秘感。

  • Create enough overlap between different data scientists that they can cross-pollinate after the sprint.

    在不同的数据科学家之间创建足够的重叠,以便他们可以在冲刺后进行异花授粉。

Cross-pollination is where data science innovation takes place.

异花授粉是进行数据科学创新的地方。

这怎么不一样呢(How is this not like the otherspan style=”font-weight: bold;”>)

Compare and contrast are extremely powerful.

比较和对比非常强大。

  • Compare and contrast of a singular object automatically include digging into historical data.

    自动比较和对比单个对象包括挖掘历史数据。

  • Commonalities and differences are intellectually itchy. Good data scientists want to know, “Why

    共同点和差异在思想上很痒。 好的数据科学家想知道,“为什么

Historical data, especially data instrumented from in-house systems, suffers semantic and concept drift. Resolving these are rife for innovation blooms.

历史数据,特别是从内部系统中获取的数据,会遭受语义和概念上的漂移。 解决这些问题很容易引发创新。

The question of, “Whyalmost certainly points outside of the organization. Data scientists must ponder wider industry, economic, and global trends. Attempts to answer these questions will create work for product managers and product software engineers: they are held responsible for improving the data and instrumentation of their products.

问题“为什么几乎可以肯定的是组织外部。 数据科学家必须考虑更广泛的行业,经济和全球趋势。 尝试回答这些问题将为产品经理和产品软件工程师创造工作:他们有责任改善产品的数据和仪器。

恼人的车轮上有油脂 (The annoying wheel gets the grease)

Preventing data scientists from using the tables or files they use all the time will be controversial. Remind them that the point is to get them to peek into the data they don’t use.

阻止数据科学家一直使用他们使用的表或文件将引起争议。 提醒他们关键是让他们窥视不使用的数据。

Expect them to file tickets with the data team. After all, sunlight is the best disinfectant.

期望他们向数据团队提交票证。 毕竟,阳光是最好的消毒剂。

这些浮游植物的花朵就在这里 (These phytoplankton blooms are where it’s at)

您如何处理数据湖的温床
Source 资源

Thoughts and feedback, please!

有想法和反馈,请!

翻译自: https://medium.com/@jennyckwan/how-do-you-handle-the-thermocline-of-your-data-lake-e585cd940572

文章知识点与官方知识档案匹配,可进一步学习相关知识Java技能树首页概览91528 人正在系统学习中 相关资源:医院床位安排系统_医院床位如何批量编辑-Java代码类资源-CSDN文库

来源:weixin_26750511

声明:本站部分文章及图片转载于互联网,内容版权归原作者所有,如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢!

上一篇 2020年9月8日
下一篇 2020年9月8日

相关推荐