This document discusses integrating a custom Poll Mode Driver (PMD) into DPDK to enable offloading packet processing tasks to multicore or application-specific integrated circuit (ASIC) hardware. A use case of offloading intrusion detection system and IPsec processing to multicore hardware while encrypted SSL traffic is decrypted via the multicore is presented. Developing a custom PMD allows leveraging multicore/ASIC as a line card or load balancer while avoiding issues like extra overhead from additional interfaces, frame reprocessing, and increased latency when using network connections between DPDK and hardware. Performance is improved by directly connecting the multicore/ASIC to DPDK which removes overhead and provides a high-throughput interface.
The document discusses Pregel, a graph-parallel processing platform developed at Google for large-scale graph processing. Pregel is inspired by the bulk synchronous parallel (BSP) model and uses a vertex-centric programming model where computation is viewed as messages passed between graph vertices. In Pregel, applications run as a series of supersteps where vertices can update themselves and pass messages to other vertices, with global synchronization between supersteps. This model is better suited for graph problems compared to more general data-parallel systems.
Este documento describe el funcionamiento del proyecto Linux Terminal Server Project (LTSP), el cual permite usar computadoras de bajo costo como estaciones de trabajo en modo grfico o de texto a travs de una red. Explica los procesos de arranque de los clientes livianos, la configuracin del servidor LTSP, incluyendo TFTP, DHCP y NFS, para proveer los recursos a los clientes y permitirles arrancar y acceder a archivos a travs de la red. Tambin detalla las opciones de configuracin en el archivo lts.conf para personalizar
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
?
Bloombergs Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
jBPM6 provides a more powerful and flexible API for managing business processes compared to jBPM5. The key changes and features in jBPM6 include:
1) A new RuntimeManager that hides session management complexity and allows configuring session handling on a per-request, per-process-instance, or singleton basis across multiple nodes.
2) CDI integration and remoting capabilities that allow injecting and accessing jBPM services like the TaskService and KieSession through REST or JMS.
3) Clustering support through technologies like Apache Zookeeper and Helix for high availability, load balancing, and distributed timers across a cluster of nodes.
The document summarizes Linux memory management architecture. It has two parts - an architecture independent memory model and implementation for a specific architecture. The memory model divides virtual address space into pages and uses page tables to map virtual to physical addresses. It also describes data structures like page table entries and region descriptors that track memory mappings.
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
?
This document summarizes a USF Spark workshop that covers Spark internals and how to optimize Spark jobs. It discusses how Spark works with partitions, caching, serialization and shuffling data. It provides lessons on using less memory by partitioning wisely, avoiding shuffles, using the driver carefully, and caching strategically to speed up jobs. The workshop emphasizes understanding Spark and tuning configurations to improve performance and stability.
Android uses cgroups to monitor system memory usage via the Low Memory Killer daemon and to group processes for effective CPU sharing. Cgroups are used to create mount points for memory and CPU control groups. The LMK daemon uses cgroups to receive memory pressure events and kill processes as needed. Init.rc uses cgroups to create groups for real-time and background tasks and assign CPU shares. Android further groups processes by scheduling policy for scheduling priorities.
Why FLOSS is a Java developer's best friend: Dave GruberJAX London
?
The explosion of new open source projects is changing the game for todays Java developers. With literally hundreds of thousands of FOSS projects underway, the opportunity to leverage open source to deliver the trifecta (faster/better/cheaper) has never been better. In this session we will explore tools and resources that can help you navigate the vast world of open source projects, in addition to sharing tips and tricks that will help you narrow the field so you can quickly get to the right projects for your next application.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
?
Bloombergs Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
jBPM6 provides a more powerful and flexible API for managing business processes compared to jBPM5. The key changes and features in jBPM6 include:
1) A new RuntimeManager that hides session management complexity and allows configuring session handling on a per-request, per-process-instance, or singleton basis across multiple nodes.
2) CDI integration and remoting capabilities that allow injecting and accessing jBPM services like the TaskService and KieSession through REST or JMS.
3) Clustering support through technologies like Apache Zookeeper and Helix for high availability, load balancing, and distributed timers across a cluster of nodes.
The document summarizes Linux memory management architecture. It has two parts - an architecture independent memory model and implementation for a specific architecture. The memory model divides virtual address space into pages and uses page tables to map virtual to physical addresses. It also describes data structures like page table entries and region descriptors that track memory mappings.
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
?
This document summarizes a USF Spark workshop that covers Spark internals and how to optimize Spark jobs. It discusses how Spark works with partitions, caching, serialization and shuffling data. It provides lessons on using less memory by partitioning wisely, avoiding shuffles, using the driver carefully, and caching strategically to speed up jobs. The workshop emphasizes understanding Spark and tuning configurations to improve performance and stability.
Android uses cgroups to monitor system memory usage via the Low Memory Killer daemon and to group processes for effective CPU sharing. Cgroups are used to create mount points for memory and CPU control groups. The LMK daemon uses cgroups to receive memory pressure events and kill processes as needed. Init.rc uses cgroups to create groups for real-time and background tasks and assign CPU shares. Android further groups processes by scheduling policy for scheduling priorities.
Why FLOSS is a Java developer's best friend: Dave GruberJAX London
?
The explosion of new open source projects is changing the game for todays Java developers. With literally hundreds of thousands of FOSS projects underway, the opportunity to leverage open source to deliver the trifecta (faster/better/cheaper) has never been better. In this session we will explore tools and resources that can help you navigate the vast world of open source projects, in addition to sharing tips and tricks that will help you narrow the field so you can quickly get to the right projects for your next application.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
?
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Spark machine learning & deep learninghoondong kim
?
Spark Machine Learning and Deep Learning Deep Dive.
Scenarios that use Spark hybrid with other data analytics tools (MS R on Spark, Tensorflow(keras) with Spark, Scikit-learn with Spark, etc)
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
?
This document discusses Apache Zeppelin, an open-source notebook for interactive data analytics. It provides an overview of Zeppelin's features, including interactive notebooks, multiple backends, interpreters, and a display system. The document also covers Zeppelin's adoption timeline, from its origins as a commercial product in 2012 to becoming an Apache Incubator project in 2014. Future projects involving Zeppelin like Helium and Z-Manager are also briefly described.
Big Data visualization with Apache Spark and Zeppelinprajods
?
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
Recommendation Engines are everywhere these days, telling us which products to buy on Amazon, which movies to watch on Netflix, which courses to take on Coursera, and on and on. This presentation is a description of the collaborative filtering and content-based recommendation engines at Jane.com, Inc magazine's fastest-growing e-commerce company of 2015.
Machine learning & security. Detect atypical behaviour in logsAlexander Melnychuk
?
This document discusses using machine learning techniques to analyze log files and detect anomalous behavior. It mentions steps like cleaning log files, preparing and clustering the data, creating a "picture of normality", and using machine learning models to detect anomalies. Various types of anomalous behaviors that could be detected are also briefly mentioned such as DDoS attacks, fuzzing, suspicious financial transactions, and toll fraud.
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...Kyuhyun Byun
?
This comprehensive presentation offers a deep dive into Go language development methodologies, covering projects of all scales. Whether you're working on a small prototype or a large-scale enterprise application, this guide provides valuable insights and best practices.
Key topics covered:
Distinguishing between small and large projects in Go
Code patterns for small, feature-focused projects
Comparison of Handler and HandlerFunc approaches
Enterprise application design using Domain Driven Design (DDD)
Detailed explanations of architectural layers: Presenter, Handler, Usecase, Service, Repository, and Recorder
NoSQL (DynamoDB) modeling techniques
Writing effective test code and using mocking tools like 'counterfeiter'
Essential tools for production-ready applications: APM, error monitoring, metric collection, and logging services
This presentation is ideal for Go developers of all levels, from beginners looking to structure their first projects to experienced developers aiming to optimize large-scale applications. It provides practical advice on code organization, testing strategies, and operational considerations to help you build robust, maintainable Go applications.
Whether you're starting a new project or looking to improve an existing one, this guide offers valuable insights into Go development best practices across different project scales and complexities.
???????????(The way to setting the Spring framework for web.)EunChul Shin
?
???? ??? ????? ???? ?? ?? ??? ???.
???? ?????? ????? ??? ??, ??? ??? ?? ?? ? ???? ??? ??? ??? ?????.
This presentation is about Spring framework.
I want to talk about the way to setting the spring framework as tidy in this presentation.
14. 14
Spark SQL application (in Java)
? (DataFrame example 2) Text File : Specifying the Schema
SparkConf sparkConf = new SparkConf().setAppName("dataFrame");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);
JavaRDD<String> people = ctx.textFile("examples/src/main/resources/people.txt");
String schemaString = "name age";
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = people.map(new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[0], fields[1].trim());
}
});
DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.show();//1
peopleDataFrame.printSchema();//2
ctx.stop();
1.
2.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import java.util.List;
15. 15
Spark SQL application (in Java)
? (DataFrame example 3) Text File : Inferring the Schema(JavaBean)
import java.io.Serializable;
public class Person implements Serializable {
private String name;
private int age;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}
16. 16
Spark SQL application (in Java)
? (DataFrame example 3) Text File : Inferring the Schema(JavaBean)
SparkConf sparkConf = new SparkConf().setAppName("dataFrame");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);
JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}});
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
schemaPeople.registerTempTable("people");
DataFrame teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19");
teenagers.show();//1
teenagers.printSchema();//2
ctx.stop();
1.
2.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import java.util.ArrayList;
import java.util.List;