在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
14. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (1)
14
Data Validation
Crypto Class
Unit Converter
Table / Auto Filter
Pivot Table
File Process Unit
Data Processor
Conditional Format
Chart / Picture
2D / 3D
Cluster/Stack/Area
Bar/Cone/Pie
Bubble/Scatter/Line
Time System
Combo / Props
Wordbook / Worksheet
Visibility
Properities
Header / Footer
Search
Page Layout
Row / Column
Alternate Content
View Properities
Data Protection
Streaming I/O
Cell
Data Types
Merge Range
Hyperlink
Formula
Cell Style
SST
Rich Text
Comments
Style Index
Calc Chain / Cache
File Format Processor
Style Process Unit
Runtime Model
Model Components Validator Calculate Engine Formula Lexer / Parser Genetaror
VBA Script
Excelize Technical Architecture Diagram
OPC Processor
Meta Processor
Relations Parser
Embeddings Media
Markup Language
Broder Fonts
Freeze Panes
Height/Width Color System
Number Format
15. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (2) Document Object Model
15
<?xml version="1.0" encoding="utf-8"?>
<Person>
<Name>Tom</Name>
<Email where="home">
<Addr>tom@example.com</Addr>
</Email>
</Person>
type Person struct {
Name string
Email struct {
Where string `xml:"where,attr"`
Addr string
}
}
encoding/xml
var p Person
if err := xml.Unmarshal([]byte(data), &p); err != nil {
fmt.Println(err)
}
fmt.Printf("%+vn", p)
// {Name:Tom Email:{Where:home Addr:tom@example.com}}
16. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (3) XML Finite State Machine
0
start start tag
NAME
TEXT
equal
end tag
value
value
end value
COMMENT
version
blank/enter
letter
digit
<
?
?>
-->
!--
= " "
' '
>
blank
/
letter
>
blank
16
17. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (4)
17
Unmarshal
NewDecoder
Decoder.unmarshal Decoder.switchToReader
unmarshalPath
unmarshalAttr unmarshalInterface
unmarshalTextInterface Decoder.RawToken
Decoder.pushElement Decoder.pushNs
encoding/xml
marshal.go
typeinfo.go
xml.go
read.go
example & test
18. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (5) Go Language XML Parser
18
type Decoder struct {
Strict bool
AutoClose []string
Entity map[string]string
CharsetReader func(charset string, input io.Reader) (io.Reader, error)
DefaultSpace string
r io.ByteReader
t TokenReader
buf bytes.Buffer
saved *bytes.Buffer
stk *stack
free *stack
needClose bool
toClose Name
nextToken Token
nextByte int
ns map[string]string
err error
line int
offset int64
unmarshalDepth int
}
encoding/xml:xml.go
StartElement
EndElement
CharData
Comment
ProcInst
Directive
19. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (6)
Go Datatypes XML Datatypes
string
anyType, ENTITY,ID, IDREF, NCName, NMTOKEN,
Name, anyURI, duration, language, normalizedString,
string, token, xml:lang, xml:space, xml:base,xml:id
[]string ENTITIES, IDREFS, NMTOKENS, NOTATION
xml.Name QName
[]byte base64Binary, hexBinary, unsignedByte
bool boolean
byte byte
float64 decimal, double, float,
int64
int, integer, long, negativeInteger,
nonNegativeInteger, nonPositiveInteger,
positiveInteger, short
uint64 unsignedInt, unsignedLong, unsignedShort
time.Time
date, dateTime, gDay, gMonth, gMonthDay, gYear,
gYearMonth,time
anyType
anySimpleType
all complex types
gYearMonth gYear gMonthDay gDay gMonth
date
time
dateTime
duration
boolean base64Binary hexBinary float double anyURI QName NOTATION
decimal
string
normalizedString integer
token long nonNegativeInteger
nonPostitveInteger
language Name NMTOKEN negativeInteger int unsignedLong positiveInteger
NCName NMTOKENS sort unsignedInt
ID IDREF ENTITY
ENTITIES
IDREFS
bytes unsignedSort
unsignedByte
ur types
build-in primitive types
build-in derived types
complex types
derived by restriction
Derived by list
Derived by extension or
restriction 19
20. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (7) Entity, Namespace & Ser/Deserialize Idempotence
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE person[
<!ENTITY name "Tom">
<!ENTITY email "tom@example.com">
]>
<person>
<name>&name;</name>
<address>&email;</address>
</person>
<?xml version="1.0" encoding="utf-8"?>
<person
xmlns="http://example.com/default"
xmlns:m="http://example.com/main"
xmlns:h="http://example.com/home"
xmlns:w="http://example.com/work">
<name>Tom</name>
<m:email h:addr="HOME" w:addr="WORK" />
</person>
type Person struct {
XMLName xml.Name `xml:"http://example.com/default person"`
Name string `xml:"name"`
Email struct {
XMLName xml.Name `xml:"http://example.com/main email"`
HomeAddr string `xml:"http://example.com/home addr,attr"`
WorkAddr string `xml:"http://example.com/work addr,attr"`
} // TAG NOT HERE: `xml:"email"`
}
Namespace Local Name
<person xmlns="http://example.com/default">
<name>Tom</name>
<email xmlns="http://example.com/main"
xmlns:home="http://example.com/home"
home:addr="HOME"
xmlns:work="http://example.com/work"
work:addr="WORK"></email>
</person>
20
21. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (8) Ser/Deserialize Idempotence
encoding/xml:xml.go
type Token interface{}
type EndElement struct {
Name Name
}
type Name struct {
Space, Local string
}
type Attr struct {
Name Name
Value string
}
type StartElement struct {
Name Name
Attr []Attr
}
// getRootEleAttr extract root element attributes by
// given XML decoder.
func getRootEleAttr(d *xml.Decoder) []xml.Attr {
tokenIdx := 0
for {
token, _ := d.Token()
if token == nil {
break
}
switch startElement := token.(type) {
case xml.StartElement:
tokenIdx++
if tokenIdx == 1 {
return startElement.Attr
}
}
}
return nil
}
21
25. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (12)
XSD: XML Schema Definition Process
is a named component and has two additional
properties - name and target namespace
is an un-named component
cmd
parser
NSResolver
proto
generator
Language Code
Attribute Attribute Group ComplexType Element Enumeration FractionDigits
Pattern
SimpleType
Schema
List
Length
Import
Include
Group
Restriction TotalDigits Union WhiteSpace MaxLength MinLength MinExclusive
Attribute Attribute Group FieldName FieldType
ComplexType
Element
Group
SimpleType
Generator
SAX Parser
Schema
Notation Declaration
system identifier
public identifier
Element Declaration
scope
value constraint
nillable
substitution group affiliation
substitution group exclusions
disallowed substitutions
abstract
Simple Type Declaration
facets
final
variety
AttributeDeclaration
scope
value constraint
Identity-constraintDeclaration
identity-constraintcategory
selector
fields
referenced key
Complex Type Declaration
derivationmethod
final
abstract
prohibitedsubstitutions
Model Group Definition
AttributeGroup Definition
AttributeUse
required
value constraint
Wildcard
namespace constraint
process contents
Particle
min occurs
max occurs
Model Group
compositor
notation declarations attributedeclarations
type definitions
element declarations
attributegroup definitions
model group definitions
type
definitions
identity-constraint
definitions
content type
type
definition
type
definition
term
content type
base type
definition
base
type
definition
base
type
definition
attribute
uses
attribute
wildcard
term
term
particles
model
group
attribute
wildcard
type
definition
attributedefinitions
attributeuses
https://github.com/xuri/xgen
25
26. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
部分模块设计与实现 (13)
Common Package Parts
Package
Relationships
Core Properties
Digital Signatures
Specific Format Parts
Office Document
Part
Relationships
XML Part
XML Part
Part
Rels
Etc…
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
<dimension ref="B2"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0" />
</sheetViews>
<sheetFormatPr baseColWidth="10" defaultRowHeight="16" />
<sheetData>
<row r="2">
<c r="B2">
<v>123</v>
</c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" />
</worksheet>
A B C
1
2 123
3
4
26
27. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
开源现状 (1) 性能
27
7.27
28.96
44.93
16.43
144.89
15.00
101.15
14.27
0 40 80 120 160
Excelize 2.6.1 Stream Mode
go1.18.1 darwin/amd64
Excelize 2.6.1
go1.18.1 darwin/amd64
tealeg/xlsx 3.2.4@1a18367
go1.18.1 darwin/amd64
unidoc/unioffice v2.2.0@4702ef2
go1.18.1 darwin/amd64
xlsxWriter RELEASE_1.2.8
python 2.7
Apache POI XSSF 4.1.2 Streaming
java version 12.0.1
PhpSpreadsheet 1.23.0 @05d08fe
PHP 8.1.4 Zend Engine v4.1.4
SheetJS/js-xlsx 0.18.5
NodeJS 18.1.0
Time Cost (s)
Less is better
62
5,126
4,746
2,883
1,083
425
3,014
2,120
0 750 1,500 2,250 3,000 3,750 4,500 5,250 6,000
Excelize 2.6.1 Stream Mode
go1.18.1 darwin/amd64
Excelize 2.6.1
go1.18.1 darwin/amd64
tealeg/xlsx 3.2.4@1a18367
go1.18.1 darwin/amd64
unidoc/unioffice v2.2.0@4702ef2
go1.18.1 darwin/amd64
xlsxWriter RELEASE_1.2.8
python 2.7
Apache POI XSSF 4.1.2 Streaming
java version 12.0.1
PhpSpreadsheet 1.23.0 @05d08fe
PHP 8.1.4 Zend Engine v4.1.4
SheetJS/js-xlsx 0.18.5
NodeJS 18.1.0
Memory Usage (MB)
Less is better
The following graph shows performance comparison of generation 102400*50 plain text matrix by the major open source Excel libs under personal computer (2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz
DDR4, 500GB SSD, macOS Monterey 12.3.1), including Go, Python, Java, PHP and NodeJS.
Benchmark script: https://github.com/xuri/excelize-benchmark
Excelize benchmark report: https://xuri.me/excelize/zh-hans/performance.html
28. 续日 · Go 夜读 · Go 语言 Excelize 开源基础库介绍 · 2022.11.03
开源现状 (2)
28
0
2000
4000
6000
8000
10000
12000
14000
2016 2017 2018 2019 2020 2021 2022
Excelize GitHub Star History Excelize Global Contributors
* 截至 2022 年 11 月,GitHub Star 1.3 万,有超过 50 个国家和地区的 350 多名贡献者,其中有 140 余人参与了代码贡献