狠狠撸

狠狠撸Share a Scribd company logo
DATANOTE
DOCUMENT-ENTITY EXPLORATION PLATFORM
Datanote is a desktop app to extract and
visualize relationships between entities
cited in .pdf, .doc etc.. documents but
also other sources such as databases or
web pages.
Datanote
"I’m a product
manager. I like
gardening, cinema
and sky diving."
"I’m a product
manager. I like
gardening, cinema
and sky diving."
"I’m a product
manager. I like
gardening, cinema
and sky diving."
"I’m a product
manager. I like
gardening, cinema
and sky diving."
"I’m a product
manager. I like
gardening, cinema
and sky diving."
"Cutaneous
administration of
unicornamycin might
transform the subject
into a unicorn."
"Cutaneous
administration of
unicornamycin might
transform the subject
into a unicorn."
Datanote
Named entities being
unique, they are a versatile
metric to establish the nature
of documents but also the
recurring patterns of these
entities themselves.
Datanote
Fed with the right input our
visual cortex can become a
powerful analysis system.
Graph visualization helps us
perceive relationships
between entities from the
micro to the macro level.
To exploit this effect we
project entities onto a 2D
plane using co-citation scores
as a distance metric.
But different kinds of questions may
impose different ways to interact with
and explore the knowledge graph.
For this reason multiple interfaces are
being developed in Datanote.
Warning: working prototype,
UI subject to change.
Datanote
Datanote
Concept.
What are the use cases?
Your Industry
What A are linked with B?
What if you built your own
extraction model using
your company data?
Human Resources
What entities are associated
with a candidate ? Or a
school, a company, a skill?
Market Intelligence
What terms are mentioned
with my brand? my
competitor? And in their
job offers?
Fraud detection
Who is mentioned in some
PDF reports? What are the
links between accounts or
phone numbers?
Behind the hood
Datanote is a stand-alone application
written in Electron.
In the prototype text extraction is
performed locally using Node
modules and data is stored in an
embedded OrientDB database.
Datanote is designed with i18n
support in mind: not only for
display but also for entity
identification.
This approach makes it possible
to process mixed-language
sources such as web pages and
social media.
This design also allows humans
to improve the model by adding
new words.
For now Datanote uses
its own datasets some
which are open-source
at github.com/datagica
Support for external
data sources and
models will probably
be asked.. and so is
planned.
Using pre-defined lists of
words works for certain
cases but what about
unknown data?
For complex entities
Datanote uses pattern
matching models to
recognise human names,
phones and IBAN numbers,
addresses, emails, spoken
languages..
Feature roadmap
Smarter extraction models? Our own
machine-learned models?
Complex knowledge graph
interrogation in Gremlin or natural
language?
Allow people to use their own "better"
models? Third party API cloud models?
Support other DBs for data storage? Full
featured search system? Chatbot API? Slack
integration?
A model or datasource plugin marketplace?
With a commission system for us?
Jupyter extension for datascientists? Web
platform to publish read-only notebooks?
What should be done ?rst?
On a more personal note:
project status
Datanote is a side project with no funding and
thus is progressing rather slowly, stopping at
times.
As I do not wish to see it disappear I am in the
process of open-sourcing it bit by bit.
But maybe it could be monetized? What would
be the market and the business model then?
That is still an open question.
Thank you!
Questions?
julian.bilcke@datagica.com
github.com/datagica

More Related Content

Datanote

  • 2. Datanote is a desktop app to extract and visualize relationships between entities cited in .pdf, .doc etc.. documents but also other sources such as databases or web pages.
  • 4. "I’m a product manager. I like gardening, cinema and sky diving."
  • 5. "I’m a product manager. I like gardening, cinema and sky diving."
  • 6. "I’m a product manager. I like gardening, cinema and sky diving."
  • 7. "I’m a product manager. I like gardening, cinema and sky diving."
  • 8. "I’m a product manager. I like gardening, cinema and sky diving."
  • 12. Named entities being unique, they are a versatile metric to establish the nature of documents but also the recurring patterns of these entities themselves.
  • 14. Fed with the right input our visual cortex can become a powerful analysis system. Graph visualization helps us perceive relationships between entities from the micro to the macro level. To exploit this effect we project entities onto a 2D plane using co-citation scores as a distance metric.
  • 15. But different kinds of questions may impose different ways to interact with and explore the knowledge graph. For this reason multiple interfaces are being developed in Datanote.
  • 16. Warning: working prototype, UI subject to change.
  • 20. What are the use cases?
  • 21. Your Industry What A are linked with B? What if you built your own extraction model using your company data? Human Resources What entities are associated with a candidate ? Or a school, a company, a skill? Market Intelligence What terms are mentioned with my brand? my competitor? And in their job offers? Fraud detection Who is mentioned in some PDF reports? What are the links between accounts or phone numbers?
  • 23. Datanote is a stand-alone application written in Electron. In the prototype text extraction is performed locally using Node modules and data is stored in an embedded OrientDB database.
  • 24. Datanote is designed with i18n support in mind: not only for display but also for entity identification. This approach makes it possible to process mixed-language sources such as web pages and social media. This design also allows humans to improve the model by adding new words.
  • 25. For now Datanote uses its own datasets some which are open-source at github.com/datagica Support for external data sources and models will probably be asked.. and so is planned.
  • 26. Using pre-defined lists of words works for certain cases but what about unknown data? For complex entities Datanote uses pattern matching models to recognise human names, phones and IBAN numbers, addresses, emails, spoken languages..
  • 28. Smarter extraction models? Our own machine-learned models? Complex knowledge graph interrogation in Gremlin or natural language? Allow people to use their own "better" models? Third party API cloud models?
  • 29. Support other DBs for data storage? Full featured search system? Chatbot API? Slack integration? A model or datasource plugin marketplace? With a commission system for us? Jupyter extension for datascientists? Web platform to publish read-only notebooks? What should be done ?rst?
  • 30. On a more personal note: project status
  • 31. Datanote is a side project with no funding and thus is progressing rather slowly, stopping at times. As I do not wish to see it disappear I am in the process of open-sourcing it bit by bit. But maybe it could be monetized? What would be the market and the business model then? That is still an open question.