際際滷s & Notes (which can be shown below slides) from a recent presentation I gave outlining some ideas on how we could utilize some of the tools and approaches being developed in bio/clinical informatics to assist in data analysis and integration in crises such as the Haiti earthquake. This is a 'straw man', I can see reasons for and against this approach so I thought I'd throw it out for comment in the hope that others can help me refine it to the point where it could be useful.
1 of 59
More Related Content
Helping Haiti - a semantic web approach to crisis information management
1. Helping Haiti - a semantic
web approach to crisis
information management
a different translational informatics project
Simon Twigger, Ph.D.
2. Questions of interest
Has anyone done any expression
studies using congenic rats?
What tissue is this gene expressed in?
What expression data is Are any of these genes
known for SD (aka SD/NHsd,
Harlan Sprague Dawley, associated with my
Sprague Dawley) rats? phenotype?
Has this gene been seen in the brain?
What rat expression studies have been done on
Mammary Cancer(aka breast neoplasms/breast
cancer/cancer of the breast, breast carcinoma...)?
5. GEO + GMiner + OBA
GEO Records
Create Annotation
Jobs & Queue Up
Q-Out
1..n Annot. Workers
RabbitMQ Index text
at OBA
Parse
Q-In
Results
Results saved to Put results in to
GMiner database queue for save
11. URGENT Christopher Frecynet is still alive under
his house. 64 Rue Nord Alexis.(RUELLE NAZON,
AVENUE POUPELARD
12. Mirna Nazaire lives in P-A-P at Bizoton 6#12.
Entire neighborhood without food. People are
dying.
13. French hospital is now open and ready to receive
the wounded at the french lycee in rue
marcadieux bourdon
14. Questions of interest
Which hospitals are open?
Who is in trouble
Does anyone have any tents?
Where are the open roads?
Any information on Person ABC?
What help is needed?
15. Who needs this info?
Aid Agencies, Non-Governmental Organizations
Red Cross, UN, etc.
Military & other relief suppliers
Individuals in Haiti
Donors - matching needs to offers
etc.
22. Mirna Nazaire lives in P-A-P at Bizoton 6#12.
Entire neighborhood without food. People are
dying.
#haiti#need food#nameMirna Nazaire
lives in#locPAP at Bizoton 6
#12#infoneighborhood w/o food. People dying
23. French hospital is now open and ready to receive
the wounded at the french lycee in rue
marcadieux bourdon
#haiti #offering hospital rooms #loc french
lycee in rue marcadieux bourdon #num 30+
#info French hospital is open and ready 2
receive
34. Inference
trapped The ontology asserts
that any thing that has
domain a has_trapped
property is a member
has_trapped of the Trappedclass
35. Inference
trapped The ontology asserts
that any thing that has
domain a has_trapped
property is a member
has_trapped of the Trappedclass
Tweet:123 #haiti #trapped 5 people #loc Pap
36. Inference
trapped The ontology asserts
that any thing that has
domain a has_trapped
property is a member
has_trapped of the Trappedclass
Tweet 123 5 people
Tweet:123 #haiti #trapped 5 people #loc Pap
37. Inference
trapped The ontology asserts
that any thing that has
domain a has_trapped
property is a member
has_trapped of the Trappedclass
Tweet 123 5 people
Tweet:123 #haiti #trapped 5 people #loc Pap
40. RDF Graph
maison des anges Insulin
has need
has contact
tweet:8550350793
has location
twitter id
18.588724,-72.275065
8550350793
has longitude
has latitude
18.588724 -72.275065
42. RDF Graph
7953197721
Delmas
twitter id
has location
tweet:7953197721
has need
has need
Insulin
medication
43. 7953197721
Delmas
twitter id
has location
tweet:7953197721
has need
has need
Insulin
medication
maison des anges Insulin
has need
has contact
tweet:8550350793
has location
twitter id
18.588724,-72.275065
8550350793
has longitude
has latitude
18.588724 -72.275065
44. 7953197721
Delmas
twitter id
has location
tweet:7953197721
has need
has need
maison des anges Insulin
Insulin
medication
has need
has contact
tweet:8550350793
has location
twitter id
18.588724,-72.275065
8550350793
has longitude
has latitude
18.588724 -72.275065
48. Using MD5 Hashes
simon twigger
f6f12de7192d1a5d903c016ecb5b3a0c
haiti loc info
26e7c844f0c80a8860d6835591117639
49. Using MD5 Hashes
simon twigger
f6f12de7192d1a5d903c016ecb5b3a0c
haiti loc info
26e7c844f0c80a8860d6835591117639
50. Using MD5 Hashes
rt @baybe_doll: #haiti #need help #name mr. bernard jean
louis #loc lumiere evangelical chapel at rue midway 22 in
carrefour #contact phone # 3778-2506
3d609759195d03a059baca1e063be4eb [3]
contact haiti loc name need
b767eeb9c16e74bfb22ee6ec0998a670 [13]
help bernard jean louis lumiere evangelical chapel rue midway carrefour ...
d4dc5272669dee93721b4c005307cfc7 [4]
52. Who is using tweet data?
http://haiti.ushahidi.com
http://haiti.sahanafoundation.org/
http://swift.ushahidi.com/
http://haiti.managingnews.com/
54. How to integrate?
Topi
Topi
c
Topi c
c
Topi
c Topi
c
Topi
c
Topi
c
Topi
c
Topi
c
Topi
c
55. How to integrate?
Topi
Topi
c
Topi c
c
Topi
c Topi
c
Top
c
i
Topi
Top
c
c
Topi
i
c
Topi
Topi c
Top
c
c
i
Topi
Topi c
c To
Topi p
c i
Topi c To
p
c c i
Topi
Topi c
c
56. How to integrate?
Topi
Topi
c
Topi c
c
Topi
c Topi
c
Top
c
i
Topi
Top
c
c
Topi
i
c
Topi
Topi c
Top
c
c
i
Topi
Topi c
c To
Topi p
c i
Topi c To
p
c c i
Topi
c Topi
Topi c
c
57. Timeline of incident reports at haiti.ushahidi.com
January 12th - February 4th 2010
58. Timeline of incident reports at haiti.ushahidi.com
January 12th - February 4th 2010
#3: We build databases that help researchers get access to and use rat data. Here’s a selection of problems that many rat researchers face, trying to answer questions based on masses of data that is too prolific to read, hard to get to, inconsistently organized and hard to integrate.
#4: NCBI’s Gene Expression Omnibus has a lot of relevant data, either as text or raw data. Can we start to capture some of this informaiton in an informatically-tractable fashion using ontologies and the OBA tools at the National Center for Biomedical Ontology in an annotation pipeline? The red boxes highlight some concepts of interest - rat strains and tissues being used in this experiment. A human can read these and know whats going on but what about a computer?
#5: We built a pipeline to take snippets of text from GEO records, fire them off into a queue and have them annotated by various ontologies at NCBO. The results are returned to another queue and loaded into the database. We then do a manual review of the automated annotations (not shown here) using a customized curation interface.
#6: Initial results focusing on GEO rat datasets has provided a lot of great information and allowed us to create some handy navigational interfaces to the data, enabling queries that were not possible on any other site. Want to find expression data for the SS rat Kidney - click the terms and the datasets appear.
#7: Initial results focusing on GEO rat datasets has provided a lot of great information and allowed us to create some handy navigational interfaces to the data, enabling queries that were not possible on any other site. Want to find expression data for the SS rat Kidney - click the terms and the datasets appear.
#8: Here’s a different area of need - the Haiti earthquake from mid January.
#10: This is the type of information that started flowing across Twitter very soon after the earthquake hit.
#12: This has valuable information but again, there is a lot of it, its unstructured and hence hard to a computer to pull out actionable data.
#13: Here are the questions facing organizations and individuals in haiti and around the world providing support
#14: Lots of people need the information but pulling it out opf plain text tweets is hard
#15: We already have somewhat structure information in biological databases, etc. there is still a lot of free text but at least we know what’s being talked about which makes interpretation somewhat easier. Nothing like this existed in the twitterverse until...
#16: The UC Boulder team came up with TweakTheTweet (TtT) as a way to structure tweets to get more information out of it
#23: Based on the tags, we can pull out information - but it still relatively unstructured. What is a ‘need’, does something tagged as ‘Need’ on this site mean the same as ‘Need’ on another site, is the Loc a lat/long, a house address, etc? Can we use ontologies as we used for biological data in GMiner to add structure and facilitate interpretation? Do these ontologies exist? No.
#24: Here are two of the ontologies we are using in GMiner - they list concepts of interest related to inbred Rat strains and mouse anatomy. These form the controlled vocabularies of relevant facts that we use to go looking in the plain text of a GEO record. Ontologies provide a more structured format for the concepts of interest and go beyond keyword lists as a way to organize and analyze annotated data.
#25: OWL ontologies have two main types of thing - Classes (things you care about) and Properties (attributes of the things you care about)
#26: The data is expressed in triples- a subject (the thing we are talking about), predicate (what type of info we are talking about, the property it posseses) and object (the information we know about the subject). Here are some examples relating to me...
#27: Created at TweakTheTweet ontology using Protege and RDF/OWL
#28: tag_terms are used to store the potential text matches in tweets - english and french (and more to come?)
#29: tag_terms are used to store the potential text matches in tweets - english and french (and more to come?)
#30: Ontology describes logical structures for the data. If Tweet 123 has a ‘has_trapped’ property associated with it, the ontology can be used to infer that Tweet123 is also part of the ‘trapped’ class of tweets. We dont have to specify this in our dataset, the ontology enables this to happen.
#31: Ontology describes logical structures for the data. If Tweet 123 has a ‘has_trapped’ property associated with it, the ontology can be used to infer that Tweet123 is also part of the ‘trapped’ class of tweets. We dont have to specify this in our dataset, the ontology enables this to happen.
#32: Ontology describes logical structures for the data. If Tweet 123 has a ‘has_trapped’ property associated with it, the ontology can be used to infer that Tweet123 is also part of the ‘trapped’ class of tweets. We dont have to specify this in our dataset, the ontology enables this to happen.
#33: Ontology describes logical structures for the data. If Tweet 123 has a ‘has_trapped’ property associated with it, the ontology can be used to infer that Tweet123 is also part of the ‘trapped’ class of tweets. We dont have to specify this in our dataset, the ontology enables this to happen.
#34: I put together a simple ruby on rails site at http://tweetneed.org that has been grabbing tweets since around 19th of January using the main TtT hashtags. Its certainly not a complete set but I’ve been using this as a platform for exploring the data and developing approaches to filter the tweets and extract useful information.
#35: Parse the tweet into useful information, trying to pull out as much useful data as possible - now have lat and long as specific fields, etc and each set of data is expressed as a triple - subject (the tweet), predicate (the property of interest) and the value. These triples can be dumped out as RDF (N-triples) and placed into a triple store.
#36: RDF data is a graph of nodes and edges - nodes are the subjects and objects, the edges correspond to the predicates in the RDF.
#37: This other tweet can be parsed to extract its relevant data - this one also contains Insulin as a need.
#38: The RDF generated for this tweet also corresponds to a graph of nodes and edges.
#39: Graphs can self-assemble based on shared properties, plus inference and Reasoners can be used to infer new class membership and organize the data in other ways as needed.
#40: Graphs can self-assemble based on shared properties, plus inference and Reasoners can be used to infer new class membership and organize the data in other ways as needed.
#41: A feature of Twitter is that people can retweet an existing tweet - this is good in that the retweet will probably reach a different audience than the original, however, in a crisis this results in a lot of duplicated data that has to be filtered through. Here’s just one tweet from the tweetneed.org database that I have in there 31x over a period of 8 hours and Im sure that is missing some as the original tweet is a RT.
#42: MD5 algorithm takes a text string and generates a unique alphanumerical string based on that text string - change any character and the MD5 value changes. Helpful as a way to take a variable length string of characters and boil it down to a unique, fixed length identifier. Often used to sign digital files - change one bit in a file and the MD5 checksum changes so you can detect if its the slightest bit different from the official value.
#43: MD5 algorithm takes a text string and generates a unique alphanumerical string based on that text string - change any character and the MD5 value changes. Helpful as a way to take a variable length string of characters and boil it down to a unique, fixed length identifier. Often used to sign digital files - change one bit in a file and the MD5 checksum changes so you can detect if its the slightest bit different from the official value.
#44: MD5 hash of the full text shows there are 3 other tweets in the database that are identical copies - this gets rid of obvious duplicates but its very conservative. Using just the hash tags isnt much good - 13 other tweets and many are different from this one, too promiscious. Using the keywords from the tweet (remove hashtags, @names, stop words and other short strings and take what is left) does a better job, identifying 4 duplicates.
#45: You can explore how this is working for a particular tweet on tweetneed.org. The tweets identified by the keyword hash include the original tweet
#46: A variety of organizations and groups are following the Tweet stream and extracting useful facts that are stored in their local databases, here’s just a few with Ushahidi and Sahana being two of the more central locations. Ushahidi is developing Swift River, a specific app to filter the stream of information from Twitter and other sources. This is still in development.
Some benefits of multiple organizations tracking the same source of data is that they may each add unique and useful information to the original source - one site may verify the info, another may find the lat/long of the location, one site may have other info that increases the urgency of a particular report. They may also serve and reach different communities with different needs. However, one downside is that each organization may do the same work multiple times and the new info added by one organization may not be available to the others.
Bringing this all back together to avoid duplication of effort and share data is not a trivial task.
#47: One potential solution is to export data in RDF using shared ontologies to describe common attributes. Place into a triple store (or federate multiple triple stores), integrate around common identifiers, use ontologies and reasoners to infer additional information not otherwise present. This could be a central location that people could query (RSS feed or REST, etc) to access additional data added by their colleagues and to access novel inferred information that was not apparent until the different data sources were merged.
#48: One potential solution is to export data in RDF using shared ontologies to describe common attributes. Place into a triple store (or federate multiple triple stores), integrate around common identifiers, use ontologies and reasoners to infer additional information not otherwise present. This could be a central location that people could query (RSS feed or REST, etc) to access additional data added by their colleagues and to access novel inferred information that was not apparent until the different data sources were merged.
#49: One potential solution is to export data in RDF using shared ontologies to describe common attributes. Place into a triple store (or federate multiple triple stores), integrate around common identifiers, use ontologies and reasoners to infer additional information not otherwise present. This could be a central location that people could query (RSS feed or REST, etc) to access additional data added by their colleagues and to access novel inferred information that was not apparent until the different data sources were merged.
#50: For some perspective, here’s an animated timeline of incident reports flowing into the haiti.ushahidi.com site since January 12th. These reports come from a wide variety of sources, SMS messages, individuals entering data on the ushahidi website and also Twitter.
#51: TtT is one of a variety of Crisis Commons projects where developers from around the world are volunteering and getting engaged building software, some related to Haiti but also for use in the next crisis that comes along.