数据科学如何改善产品? (How can data science improve products?)

什么是预测模型? (What are predictive models?)

您如何从见识到原型到生产应用? (How do you go from insight to prototype to production application?)

This is an excerpt from “Applied Data Science,” A Yhat whitepaper about data science teams and how companies apply their insights to the real world. You’ll learn how successful data science teams are composed and operate and which tools and technologies they are using.

这摘自“应用数据科学”,这是一份有关数据科学团队以及公司如何将其见识应用于现实世界的Yhat白皮书。 您将学习成功的数据科学团队的组成和运作方式,以及他们使用的工具和技术。

We discuss the byproducts of data science and their implications beyond analysts’ laptops and answer the question of what to do with predictive models once they’re built. Lastly, we inspect the post-model-building process to highlight the most common pitfalls we see companies make when applying data science work to live data problems in day-to-day business functions and applications.

我们讨论了数据科学的副产品及其对分析师笔记本电脑以外的影响,并回答了构建预测模型后该如何处理的问题。 最后,我们检查了模型建立后的过程,以突出显示公司在将数据科学工作应用于日常业务功能和应用程序中的实时数据问题时所遇到的最常见的陷阱。

描述数据科学 (Describing data science)

In the wake of an increasingly digital economy, businesses are racing to build operational knowledge around the vast sums of data they produce each day. And with data now at the center of almost every business function, developing practices for working with data is critical regardless of your company’s size or industry.

随着数字经济的日益发展,企业正在竞相围绕每天产生的大量数据建立运营知识。 由于数据现在几乎是每个业务功能的中心,因此无论公司的规模或行业如何,开发处理数据的实践都至关重要。

“Data science,” one of many recently popularized terms floating amidst the myriad of buzzwords and big data hoopla, is a field concerned with the extraction of knowledge from data. Practitioners—aptly named “data scientists”—are those charged with solving complex and sophisticated problems related to data usually employing a highly diversified blend of scientific and technical tools as well as deep business and domain expertise.

“数据科学”是在众多流行语和大数据“喧嚣”中浮动的许多最近流行的术语之一,是一个涉及从数据中提取知识的领域。 从业人员(恰当地称为“数据科学家”)是负责解决与数据相关的复杂问题的人员,通常使用高度多样化的科学和技术工具以及深厚的业务和领域专业知识来完成这些工作。

“What distinguishes data science itself from the tools and techniques is the central goal of deploying effective decision-making models to a production environment.” -John Mount & Nina Zumel, Practical Data Science with R

“将数据科学与工具和技术区分开来的主要目标是将有效的决策模型部署到生产环境中。” -John Mount和Nina Zumel,R的实用数据科学

数据科学的中心目标 (The central goal of data science)

As is the case with any analytical project, the central goal in data science is to produce practical and actionable insights to improve the business. That is to say, data scientists overcome complexities involved in data to empower businesses to make better operational decisions, optimize processes, and improve products and services used by customers and non-technical employees.

与任何分析项目一样,数据科学的中心目标是产生切实可行的见解以改善业务。 也就是说,数据科学家克服了数据中涉及的复杂性,从而使企业能够制定更好的运营决策,优化流程并改善客户和非技术人员使用的产品和服务。

典型数据科学项目简介 (Profile of a typical data science project)

项目范围和定义 (Project scope and definition)

With broad strokes, a data science project begins with some question, need, or goal in mind and with varying degrees of focus. Accordingly, a data scientist’s primary task at the start of a new project is to refine the goal and develop concrete project objectives.

数据科学项目以广泛的笔触开始时考虑了一些问题,需求或目标,并以不同程度的重点。 因此,数据科学家在新项目开始时的首要任务是完善目标并制定具体的项目目标。

 

Analysts will first conduct a preliminary survey of the data, applying domain knowledge to develop a clear and succinct problem definition to serve as the principal object of study.

分析师将首先对数据进行初步调查,运用领域知识来制定清晰,简洁的问题定义,并将其作为主要研究对象。

识别相关数据集 (Identify relevant data sets)

With a narrow and expressive definition of the problem, data scientists can begin to evaluate different data sets to identify which variables are likely to be relevant to the problem they are trying to solve. Evaluating which data sets should be used for the project, however, is not an activity performed in isolation. Most companies have numerous data sets, each highly diverse in shape, composition and size. Analysts may or may not be familiar with a particular data source, how to query it, where it comes from, what it describes or even that it exists.

通过对问题的狭义表达,数据科学家可以开始评估不同的数据集,以确定哪些变量可能与他们要解决的问题有关。 然而,评估哪些数据集应用于该项目并不是一项孤立的活动。 大多数公司都有大量数据集,每个数据集的形状,组成和大小都非常不同。 分析人员可能熟悉或可能不熟悉特定的数据源,如何查询它,它来自何处,它描述了什么,甚至不熟悉它。

For these reasons, quantitative analysts are usually working in proximity to or in direct collaboration with engineers, marketers, operations teams, product managers, and other stakeholders to gain a robust and intimate understanding of the data sources at their disposal.

由于这些原因,定量分析人员通常与工程师,市场营销人员,运营团队,产品经理和其他利益相关者紧密合作,或与他们直接合作,以获取对可支配数据源的深入了解。

跨职能协作 (Cross-functional collaboration)

Collaboration at this stage is not only valuable for identifying which data are relevant to a problem but also for ensuring the ultimate viability of any resulting solution. Hybrid teams composed of stakeholders in separate functions produce deeper collective understanding of both the problem and the data at the center of any project. Knowing how a data set is created and stored, how often it changes, and its reliability are critical details that can make or break the feasibility of a data product.

在此阶段的协作不仅对确定与问题相关的数据有价值,而且对于确保任何最终解决方案的最终可行性都是有价值的。 由各个职能部门的利益相关者组成的混合团队可以对问题和任何项目中心的数据进行更深入的集体理解。 了解数据集的创建和存储方式,更改的频率及其可靠性是至关重要的细节,这些细节可能决定或破坏数据产品的可行性。

For example, consider a new credit-scoring algorithm more accurate than previous methods but that relies on data no longer sold by the credit bureau. Such circumstances are common today given that data sets are so diverse and subject to frequent change. By incorporating interdepartmental expertise in the early stages of model development, companies dramatically reduce the risk of pursuing unanswerable questions and ensure data scientists are focusing attention on the most suitable data sets.

例如,考虑一种新的信用评分算法,该算法比以前的方法更准确,但它依赖于信用局不再出售的数据。 鉴于数据集是如此多样且经常变化,今天这种情况很普遍。 通过在模型开发的早期阶段就纳入部门间的专业知识,公司可以大大降低寻求无法回答的问题的风险,并确保数据科学家将注意力集中在最合适的数据集上。

建筑模型 (Model-building)

After firming up the project’s definition and completing a preliminary survey of the data, analysts enter the model-building phase of analytics lifecycle. The notion of “model” is often obscure and can be difficult to define, even for those well versed in data science vocabulary.

在确定了项目的定义并完成了对数据的初步调查之后,分析师进入了分析生命周期的模型构建阶段。 即使对于那些精通数据科学词汇的人来说,“模型”的概念通常也很模糊,很难定义。

A statistical model, in short, is an abstract representation of some relationship between variables in data. In other words, a model describes how one or more random, or independent, variables relate to one or more other dependent variables. A simple linear regression model might, for example, describe the relationship between years of education (X) and personal income (y).

简而言之,统计模型是数据变量之间某些关系的抽象表示。 换句话说,模型描述了一个或多个随机或独立变量与一个或多个其他因变量的关系。 例如,一个简单的线性回归模型可以描述受教育年限(X)与个人收入(y)之间的关系。

A statistical model is an abstract representation of some relationship between variables in data.

统计模型是数据中变量之间某些关系的抽象表示。

But linear regression is far from the only way to represent the relationships in data, and identifying the right algorithms and machine learning methods for your problem is largely an exploratory exercise. Data scientists apply knowledge of the business and advanced research skills to identify those algorithms and methods most likely to be effective for solving a problem. Many and perhaps most data science studies are bound up with solving some combination of clustering, regression, classification, and/or ranking problems. And within each of these categories are numerous algorithms that may or may not be suitable for tackling a given problem.

但是,线性回归并不是唯一表示数据关系的方法,针对问题确定正确的算法和机器学习方法主要是一种探索性练习。 数据科学家运用业务知识和高级研究技能来确定最有可能有效解决问题的算法和方法。 许多(也许是大多数)数据科学研究都局限于解决聚类,回归,分类和/或排名问题的某种组合。 在这些类别的每一个类别中,有很多算法可能不适合解决给定的问题。

To that end, the model-building phase is characterized by rigorous testing of different algorithms and methods drawing from one or more of these problem classes (i.e. clustering, regression, classification, and ranking) with the ultimate goal being to identify the “best” way to model some underlying business phenomenon. “Best,” importantly, will take on a different meaning depending on the problem, the data, and the situational nuances tied to the project. For example, the “best” way to model the quality of the Netflix recommendation system is very different from the “best” way to model the quality of a credit-scoring algorithm.

为此,模型构建阶段的特征是严格测试来自一个或多个这些问题类别(即,聚类,回归,分类和排名)的不同算法和方法,其最终目标是确定“最佳”一些潜在的商业现象建模的方法。 重要的是,“最佳”将根据问题,数据和与项目相关的情况细微差别而具有不同的含义。 例如,对Netflix推荐系统的质量进行建模的“最佳”方法与对信用评分算法的质量进行建模的“最佳”方法非常不同。

可行的数据科学及其在运营中的应用 (Actionable data science & applications in operations)

When a data science project progresses beyond the model-building phase, the core question is how best to take advantage of the insights produced. This is a critical junction and one ultimately determines the practical ROI your data science investment.

当数据科学项目超出模型构建阶段时,核心问题是如何最好地利用所产生的见解。 这是一个至关重要的环节,最终决定了数据科学投资的实际投资回报率。

Extracting value from data is like any other value chain. Companies expend resources to convert raw material—in this case data—into valuable products and services suitable for the market.

从数据中提取价值就像其他任何价值链一样。 公司花费资源将原材料(在这种情况下为数据)转换为适合市场的有价值的产品和服务。

A data product provides actionable information without exposing decision makers to the underlying data or analytics. Examples include: movie recommendations, weather forecasts, stock marketing predictions, production process improvements, health diagnoses, flu trend predictions, and targeted advertising. -Mark Herman, et al., Field Guide to Data Science

数据产品可提供可操作的信息,而不会使决策者暴露于基础数据或分析中。 示例包括:电影推荐,天气预报,股票行销预测,生产过程改进,健康诊断,流感趋势预测和定向广告。 -Mark Herman等人,《数据科学领域指南》

As is the case with any value chain, a product gains value as it progresses from one lifecycle stage to the next. Therefore, the manner in which activities in the chain are carried out is important as it often impacts the system’s value.

与任何价值链一样,产品从一个生命周期阶段进入下一个生命周期阶段就获得了价值。 因此,链中活动的执行方式很重要,因为它通常会影响系统的价值。

Consider the product recommendations example again—our goal is to increase average order size for shoppers on our website by recommending other products users will find relevant.

再次考虑产品推荐示例-我们的目标是通过推荐用户会发现相关的其他产品来增加购物者在我们网站上的平均订单量。

数据科学生命周期步骤: (Data science lifecycle steps:)
  1. Refine The Problem Definition

  2. Survey The Raw Material And Evaluate Which Data To Include In The Model

  3. Rigorously Test Modeling Techniques

  4. Identify A Winning Modeling Strategy For Implementation

  5. Integrate Recommendations Into The Website To Influence Customers

  6. 完善问题定义

  7. 调查原材料并评估要在模型中包括哪些数据

  8. 严格测试建模技术

  9. 确定实施的成功建模策略

  10. 将建议整合到网站中以影响客户

Common sense indicates that progressing through step four without achieving step five falls short of the objective. But, sadly, this is a common scenario among companies developing data science capabilities. Similarly, it is often the case that hypotheses are disproved only after companies have invested substantial time and effort engineering large-scale analytics implementations for models which later prove to be suboptimal or entirely invalid.

常识表明,在没有实现步骤5的情况下完成步骤4并没有达到目标。 但是,可悲的是,这是开发数据科学功能的公司之间的常见情况。 同样,通常情况下,只有在公司投入大量时间和精力对模型进行大规模分析实施后,才证明假设不成立,这些模型后来被证明是次优或完全无效的。