博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
python概率编程_Python中的概率编程
阅读量:2527 次
发布时间:2019-05-11

本文共 13966 字,大约阅读时间需要 46 分钟。

python概率编程

Learn about probabilistic programming in this guest post by Osvaldo Martin, a researcher at The National Scientific and Technical Research Council (CONICET).

美国国家科学技术研究委员会(CONICET)的研究人员Osvaldo Martin在本来宾文章中了解了概率编程。

贝叶斯推理 (Bayesian Inference)

Bayesian statistics is conceptually very simple; we have the knowns and the unknowns; we use Bayes’ theorem to condition the latter on the former. If we are lucky, this process will reduce the uncertainty about the unknowns.

贝叶斯统计在概念上非常简单; 我们拥有的已知, 和未知 ; 我们使用贝叶斯定理将后者定为前者。 如果幸运的话,这个过程将减少未知数的不确定性。

Generally, we refer to the knowns as data and treat it like a constant, and the unknowns as parameters and treat them as probability distributions. In more formal terms, we assign probability distributions to unknown quantities. Then, we use Bayes’ theorem to transform the prior probability distribution

一般情况下,我们指的的已知, 数据和把它像一个常数, 未知作为参数 ,并把它们作为概率分布。 用更正式的术语来说,我们将概率分布分配给未知数量。 然后,我们使用贝叶斯定理转换先验概率分布

 

into a posterior distribution:

变成后验分布:

 

Although conceptually simple, fully probabilistic models often lead to analytically intractable expressions. For many years, this was a real problem and was probably one of the main issues that hindered the wide adoption of Bayesian methods.

尽管从概念上讲很简单,但是完全概率模型通常会导致分析上难以处理的表达式。 多年来,这是一个实际问题,并且可能是阻碍贝叶斯方法广泛采用的主要问题之一。

The arrival of the computational era and the development of numerical methods that, at least in principle, can be used to solve any inference problem, has dramatically transformed the Bayesian data analysis practice.

计算时代的到来以及数值方法的发展,至少在原则上可以用来解决任何推理问题,这些都极大地改变了贝叶斯数据分析的实践。

概率编程:推理按钮 (Probabilistic Programming: the Inference-Button)

We can think of these numerical methods as universal inference engines, or as Thomas Wiecki, the core developer of PyMC3, likes to call it, the inference-button. The possibility of automating the inference process has led to the development of probabilistic programming languages (PPL), which allow for a clear separation between model creation and inference.

我们可以将这些数值方法视为通用推理引擎,或者PyMC3的核心开发人员Thomas Wiecki喜欢将其称为推理按钮。 自动化推理过程的可能性导致了概率编程语言PPL )的发展,该模型允许在模型创建和推理之间进行清晰的分离。

In the PPL framework, users specify a full probabilistic model by writing a few lines of code, and then inference follows automatically. It is expected that probabilistic programming will have a major impact on data science and other disciplines by enabling practitioners to build complex probabilistic models in a less time-consuming and less error-prone way.

在PPL框架中,用户通过编写几行代码来指定完整的概率模型,然后自动进行推理。 期望概率编程将通过使从业人员以更少的时间消耗和更少的错误率构建复杂的概率模型,从而对数据科学和其他学科产生重大影响。

I think one good analogy for the impact that programming languages can have on scientific computing is the introduction of the Fortran programming language more than six decades ago. While Fortran has lost its shine nowadays, at one time, it was considered to be very revolutionary.

我认为可以很好地类比编程语言对科学计算的影响,就是六十多年前引入了Fortran编程语言。 尽管如今Fortran失去了光芒,但曾经一次,它被认为具有很大的革命性。

Arnold Reinhold [CC BY-SA 2.5], via

Arnold Reinhold [CC BY-SA 2.5],通过

For the first time, scientists moved away from computational details and began focusing on building numerical methods, models, and simulations in a more natural way. In a similar fashion, we now have PPL, which hides details on how probabilities are manipulated and how the inference is performed from users, allowing users to focus on model specification and the analysis of the results.

科学家们第一次摆脱了计算细节,开始专注于以更自然的方式构建数值方法,模型和模拟。 现在,我们以类似的方式提供了PPL,它隐藏了有关如何操作概率以及如何从用户进行推理的详细信息,使用户可以专注于模型规范和结果分析。

In this article, you will learn how to use PyMC3 to define and solve models. We will treat the inference-button as a black box that gives us proper samples from the posterior distribution. The methods we will be using are stochastic, and so the samples will vary every time we run them.

在本文中,您将学习如何使用PyMC3定义和求解模型。 我们将推理按钮视为黑盒,该黑盒为我们提供了后验分布中的适当样本。 我们将使用的方法是随机的,因此样本在每次运行时都会有所不同。

However, if the inference process works as expected, the samples will be representative of the posterior distribution and thus we will obtain the same conclusion from any of those samples.

但是,如果推断过程按预期工作,则样本将代表后验分布,因此我们将从这些样本中的任何一个获得相同的结论。

PyMC3引物 (PyMC3 primer)

PyMC3 is a Python library for probabilistic programming. The latest version at the moment of writing is 3.6. PyMC3 provides a very simple and intuitive syntax that is easy to read and close to the syntax used in statistical literature to describe probabilistic models. PyMC3’s base code is written using Python, and the computationally demanding parts are written using NumPy and Theano.

PyMC3是用于概率编程的Python库。 撰写本文时,最新版本为3.6。 PyMC3提供了一种非常简单直观的语法,该语法易于阅读,并且接近于统计文献中用于描述概率模型的语法。 PyMC3的基本代码是使用Python编写的,而对计算有要求的部分则使用NumPy和Theano编写。

什么是Theano? (What is Theano?)

Theano is a Python library that was originally developed for deep learning and allows us to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. The main reason PyMC3 uses Theano is because some of the sampling methods, such as NUTS, need gradients to be computed, and Theano knows how to compute gradients using what is known as automatic differentiation.

Theano是一个Python库,最初是为深度学习而开发的,它使我们能够有效地定义,优化和评估涉及多维数组的数学表达式。 PyMC3使用Theano的主要原因是因为某些采样方法(例如NUTS)需要计算梯度,并且Theano知道如何使用所谓的自动微分来计算梯度。

Also, Theano compiles Python code to C code, and hence PyMC3 is really fast. This is all the information about Theano we need to have to use PyMC3. If you still want to learn more about it, start reading the official Theano tutorial at .

此外,Theano将Python代码编译为C代码,因此PyMC3确实非常快。 这是有关Theano的所有信息,我们需要使用PyMC3。 如果您仍然想了解更多有关它的信息,请从开始阅读Theano官方教程。

注意 (Note)

You may have heard that Theano is no longer developed, but that’s no reason to worry. PyMC devs will take over Theano maintenance, ensuring that Theano will keep serving PyMC3 for several years to come.

您可能已经听说Theano不再开发了,但这不必担心。 PyMC开发人员将接管Theano的维护工作,以确保Theano在未来几年内将继续为PyMC3提供服务​​。

At the same time, PyMC devs are moving quickly to create the successor to PyMC3. This will probably be based on TensorFlow as a backend, although other options are being analyzed as well. You can read more about this at the following blog post: .

同时,PyMC开发人员正在Swift采取行动,以创建PyMC3的后续产品。 尽管可能也在分析其他选项,但这可能基于TensorFlow作为后端。 您可以在以下博客文章中了解更多有关此内容: : 。

以PyMC3方式翻转硬币 (Flipping coins the PyMC3 way)

We’ll be generating some fictional data, so you can assume that we know the true value of

我们将生成一些虚构的数据,因此您可以假设我们知道的真实价值。

, called theta_real, in the following code. Of course, for a real dataset, we will not have this knowledge:

在以下代码中称为theta_real。 当然,对于真实的数据集,我们将不具备以下知识:

np.random.seed(123)trials = 4theta_real = 0.35 # unknown value in a real experimentdata = stats.bernoulli.rvs(p=theta_real, size=trials)

型号规格 (Model specification)

Now that we have the data, we need to specify the model. This is done by specifying the likelihood and the prior using probability distributions. For the likelihood, we will use the binomial distribution with

现在我们有了数据,我们需要指定模型。 这是通过指定可能性和先验概率分布来完成的。 对于可能性,我们将使用二项分布

 and

 

, and for the prior, a beta distribution with the parameters:

,对于先前的参数,则为Beta分布:

A beta distribution with such parameters is equivalent to a uniform distribution in the interval [0, 1]. We can write the model using the following mathematical notation:

具有此类参数的beta分布等效于间隔[0,1]中的均匀分布。 我们可以使用以下数学符号编写模型:

 

This statistical model has an almost one-to-one translation to PyMC3:

这种统计模型几乎可以将PyMC3一对一翻译:

with pm.Model() as our_first_model:    θ = pm.Beta('θ', alpha=1., beta=1.)    y = pm.Bernoulli('y', p=θ, observed=data)    trace = pm.sample(1000, random_seed=123)

The first line of the code creates a container for our model. Everything inside the with-block will be automatically added to our_first_model. You can think of this as syntactic sugar to ease model specification as we do not need to manually assign variables to the model. The second line specifies the prior. As you can see, the syntax follows the mathematical notation closely.

代码的第一行为我们的模型创建一个容器。 with块中的所有内容都会自动添加到our_first_model中。 您可以将其视为简化模型规范的语法糖,因为我们不需要手动将变量分配给模型。 第二行指定先验。 如您所见,语法紧密遵循数学符号。

注意 (Note)

Please note that we’ve used the name θ twice, first as a Python variable and then as the first argument of the Beta function; using the same name is a good practice to avoid confusion. The θ variable is a random variable; it is not a number, but an object representing a probability distribution from which we can compute random numbers and probability densities.

请注意,我们已经两次使用了名称θ,首先是Python变量,然后是Beta函数的第一个参数。 使用相同名称是避免混淆的一种好习惯。 θ变量是随机变量; 它不是数字,而是代表概率分布的对象,从中我们可以计算随机数和概率密度。

The third line specifies the likelihood. The syntax is almost the same as for the prior, except that we pass the data using the observed argument. In this way, we can tell PyMC3 that we want to condition for the unknown on the knowns (data). The observed values can be passed as a Python list, a tuple, a NumPy array, or a pandas DataFrame.

第三行指定可能性。 除了我们使用观察到的参数传递数据外,语法与之前的语法几乎相同。 通过这种方式,我们可以告诉PyMC3,我们希望对已知信息(数据)中的未知信息进行处理。 观察值可以作为Python列表,元组,NumPy数组或pandas DataFrame传递。

Now, we are finished with the model’s specification! Pretty neat, right?

现在,我们完成了模型的规格! 很整洁吧?

按下推论按钮 (Pushing the inference button)

The last line is the inference button. We are asking for 1,000 samples from the posterior and will store them in the trace object. Behind this innocent line, PyMC3 has hundreds of oompa loompas, singing and baking a delicious Bayesian inference just for you! Well, not exactly, but PyMC3 is automating a lot of tasks. If you run the code, you will get a message like this:

最后一行是推断按钮 。 我们要求从后验中获取1,000个样本,并将其存储在跟踪对象中。 在这一天真线条的背后,PyMC3拥有数百个oompa loompas ,为您唱歌和烘焙美味的贝叶斯推论! 是的,不完全是,但是PyMC3正在自动执行许多任务。 如果运行代码,您将收到以下消息:

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [θ] 100%|██████████| 3000/3000 [00:00<00:00, 3695.42it/s]

The first and second lines tell us that PyMC3 has automatically assigned the NUTS sampler (one inference engine that works very well for continuous variables), and has used a method to initialize that sampler. The third line says that PyMC3 will run two chains in parallel, so we can get two independent samples from the posterior for the price of one.

第一和第二行告诉我们PyMC3已自动分配了NUTS采样器(一个对连续变量非常有效的推理引擎),并使用了一种初始化该采样器的方法。 第三行说PyMC3将并行运行两个链,因此我们可以从后验得到两个独立的样本,价格为一个。

The exact number of chains is computed taking into account the number of processors in your machine; you can change it using the chains argument for the sample function. The next line is telling us which variables are being sampled by which sampler.

计算链的确切数量时要考虑到计算机中的处理器数量。 您可以使用示例函数的chains参数来更改它。 下一行告诉我们哪个采样器正在采样哪个变量。

For this particular case, this line is not adding any new information because NUTS is used to sample the only variable we have θ. However, this is not always the case as PyMC3 can assign different samplers to different variables. This is done automatically by PyMC3 based on the properties of the variables, which ensures that the best possible sampler is used for each variable. Users can manually assign samplers using the step argument of the sample function.

对于这种特殊情况,该行不会添加任何新信息,因为NUTS用于采样唯一具有θ的变量。 但是,并非总是如此,因为PyMC3可以将不同的采样器分配给不同的变量。 PyMC3会根据变量的属性自动完成此操作,从而确保为每个变量使用最佳采样器。 用户可以使用sample函数的step参数手动分配采样器。

Finally, the last line is a progress bar, with several related metrics indicating how fast the sampler is working, including the number of iterations per second. If you run the code, you will see the progress-bar get updated really fast. Here, we are seeing the last stage when the sampler has finished its work.

最后,最后一行是进度条,带有几个相关的指标,指示采样器的运行速度,包括每秒的迭代次数。 如果运行代码,您将看到进度栏的更新非常快。 在这里,我们看到了采样器完成工作的最后阶段。

The numbers are 3000/3000, where the first number is the running sampler number (this starts at 1), and the last is the total number of samples. You will notice that we have asked for 1,000 samples, but PyMC3 is computing 3,000 samples. We have 500 samples per chain to auto-tune the sampling algorithm (NUTS, in this example).

数字是3000/3000,其中第一个数字是运行中的采样器编号(从1开始),最后一个是采样总数。 您会注意到我们已经要求提供1,000个样本,但是PyMC3正在计算3,000个样本。 每个链上有500个样本可以自动调整采样算法(在此示例中为NUTS)。

This sample will be discarded by default. We also have 1,000 productive draws per-chain, thus a total of 3,000 samples are generated. The tuning phase helps PyMC3 provide a reliable sample from the posterior. We can change the number of tuning steps with the tune argument of the sample function.

默认情况下,此样本将被丢弃。 我们每条链也有1000个生产性抽签,因此总共产生了3,000个样本。 调整阶段可帮助PyMC3从后部提供可靠的样本。 我们可以使用样本函数的tune参数更改调整步骤的数量。

总结后 (Summarizing the posterior)

Generally, the first task we will perform after sampling from the posterior is check what the results look like. The plot_trace function from ArviZ is ideally suited to this task:

通常,从后验采样后,我们要执行的第一个任务是检查结果是什么样的。 ArviZ的plot_trace函数非常适合此任务:

az.plot_trace(trace)

By using az.plot_trace, we get two subplots for each unobserved variable. The only unobserved variable in our model is

通过使用az.plot_trace,我们为每个未观察到的变量获得了两个子图。 我们模型中唯一未观察到的变量是

Notice that y is an observed variable representing the data; we do not need to sample that because we already know those values. Thus, in the above figure, we have two subplots. On the left, we have a Kernel Density Estimation (KDE) plot; this is like the smooth version of the histogram. On the right, we get the individual sampled values at each step during the sampling. From the trace plot, we can visually get the plausible values from the posterior.

注意y是表示数据的观察变量; 我们不需要进行采样,因为我们已经知道这些值。 因此,在上图中,我们有两个子图。 在左侧,我们有一个内核密度估计KDE )图。 这就像直方图的平滑版本。 在右侧,我们在采样过程中的每一步都获得了单独的采样值。 从轨迹图中,我们可以从后部直观地获得合理的值。

If this article piqued your interest in Bayesian Analysis, you can check out by Osvaldo Martinez. A step-by-step guide following a modern, practical, and computational approach, is a must-read for students, data scientists, developers, and researchers alike who want to get started with probabilistic programming.

如果本文引起了您对贝叶斯分析的兴趣,那么您可以查看Osvaldo Martinez撰写的 。 遵循现代,实用和计算方法的循序渐进指南是想要开始概率编程的学生,数据科学家,开发人员和研究人员的必读资料。

翻译自:

python概率编程

转载地址:http://hdqwd.baihongyu.com/

你可能感兴趣的文章
nginx 高并发配置参数(转载)
查看>>
Jquery异步请求数据实例
查看>>
洛谷 CF937A Olympiad
查看>>
bzoj 3876: [Ahoi2014]支线剧情
查看>>
file_get_contens POST传值
查看>>
关于overflow:hidden
查看>>
【SpringBoot学习笔记】注解的作用——@FeignClient
查看>>
Java集合总结
查看>>
Codeforces Round #445 C. Petya and Catacombs【思维/题意】
查看>>
用MATLAB同时作多幅图
查看>>
python中map的排序以及取出map中取最大最小值
查看>>
ROR 第一章 从零到部署--第一个程序
查看>>
<form>标签
查看>>
vue去掉地址栏# 方法
查看>>
Lambda03 方法引用、类型判断、变量引用
查看>>
was集群下基于接口分布式架构和开发经验谈
查看>>
MySQL学习——MySQL数据库概述与基础
查看>>
ES索引模板
查看>>
各种 机器学习方法 / 学习范式 汇总
查看>>
HDU2112 HDU Today 最短路+字符串哈希
查看>>