-
Apache Lucene
July 23rd, 2009
http://lucene.apache.org/
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.I’ll be playing with Apache Lucene http://lucene.apache.org/ It looks perfect for what I do. More notes as I work with it.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
-
Python versus Perl for text mining
July 19th, 2009
I love Python. However when I’m working with large files (say > 100,000 records) i find that Perl is considerably faster for processing that text and uses much less memory. Perl was originally designed for text processing so I guess it is more optimized for that task. So my preprocessing is usually a mix of Perl and Python … then the app to automate and integrate that with a database is usually all python and the front end is usually XHMTL, CSS, Php and Javascript.
Different tools for different jobs I guess?
-
Downloading Project Gutenberg with wget
July 18th, 2009
I’m doing a text anlaysis project to look a word frequency and co-occurrencewget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enhttp://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pageswget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enhttp://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_PagesI’m doing a text anlaysis project to look at word frequency and co-occurrence. I’m thinking of using 7 million medical abstracts from PubMed, 100,000 random web pages, a million random blog enteries, wikipedia and Project Gutenberg. I looked for some mirrors of Project Gutenberg and was going to wget the .txt files. A simple linux statment like the following for each directory.
wget -nd -r -l1 –no-parent -A.txt http://www.bsu.edu/libraries/gutenberg/default.asp?path=/gutenberg/etext00nd no directory, by default wget creates a dir
-r recursively download
-l1 (L one) level 1, download only of that particular folder, don’t go depth on it.
–no-parent I definately don’t want the parent’s files
-A.txt means take only files with .txt extension
… but it turns out that Project Gutenberg makes it even easier. Info on the page
http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
to get all of the English language txt files i just need the following statement
wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enThis statement even creates a nice logical directory organization of the downloaded files on your system!!
Project Gutenberg is sweet!!!!
-
Wordpress wp_posts SQL Hack
July 17th, 2009
Speeding up Wordpress by deleting all revisions.
The Wordpress db keeps all revisions. If you have a large database (say 100,000 records) you can speed it up by deleting the revisions. The code for that is:
DELETE FROM wp_posts WHERE post_type = “revision”;Your actual wp_posts table will have a name like wp_xd7tjs_posts. This is to improve Wordpress security. If your wp_posts is named wp_xd7tjs_posts then the code is:
DELETE FROM wp_xd7tjs_posts WHERE post_type = “revision”; -
wordpress.org
July 17th, 2009
wordpress.org great software for developing most simple web sites. Wordpress’ strength lies in its simplicity. Many CMS are often over engineered. Wordpress gets the job done for most web sites and is a great way to learn XHTML and Php.
-
Privacy Policy
July 17th, 2009
Abacadaba.com only keeps web traffic logs and uses Google AdSense. We require no information other than name and e-mail to register for the site.
Abacadaba.com uses Google AdSense. AdSense uses cookies to serve ads to this site. The following explains the Google AdSense use of cookies better than Abacadaba.com can. The following is quoted from the Google AdSense website:
What should I put in my privacy policy?
Your posted privacy policy should include the following information about Google and the DoubleClick DART cookie:
- Google, as a third party vendor, uses cookies to serve ads on your site.
- Google’s use of the DART cookie enables it to serve ads to your users based on their visit to your sites and other sites on the Internet.
- Users may opt out of the use of the DART cookie by visiting the Google ad and content network privacy policy.
-
peacesymbol.org
July 17th, 2009
abacadaba wrote all scripts and did all the design for peacesymbol.org
PeaceSymbol.org is a tribute to the peace symbol (aka the peace sign or CND logo).Corporations have used logos to effectively promote consumption. Likewise the peace symbol is a logo that can be used to promote peace and social justice.PeaceSymbol.org believes that by providing peace related clip art and inspirational images one can evangelize the use of the peace sign on websites, posters, t-shirts, mugs, etc. which in turn will promote peace and social justice. PeaceSymbol.org welcomes contributions of peace related art and photos.PeaceSymbol.org is a tribute to the peace symbol (aka the peace sign or CND logo). Corporations have used logos to effectively promote consumption. Likewise the peace symbol is a logo that can be used to promote peace and social justice.
PeaceSymbol.org believes that by providing peace related clip art and inspirational images one can evangelize the use of the peace sign on websites, posters, t-shirts, mugs, etc. which in turn will promote peace and social justice.
-
clipartist.net
July 17th, 2009
abacadaba wrote all scripts and did all the design for clipartist.net
clipartist.net aggregates, annotates and enhances public domain and creative commons clip art. This is an experiment in developing search for SVG images. commons.wikimedia.org and openclipart.org are great sites but difficult to search. clipartist.net develops python and perl scripts to collect, enhance (e.g. resize, recolor) and annotate (e.g. determine the colors in an SVG image) and add those as meta information for the image. clipartist.net also has python scripts to automatically add posts to a Wordpress database.
This is an ongoing experiment and a sandbox for playing with SVG images. Exactly what clipartist.net does to work with SVG based clip art will depend on what seems dobale and useful. clipartist.net basically tries stuff by writing scripts to play with SVG data and looks to see whether those scripts adds any value to what exists on commons.wikimedia.org and openclipart.org
-
Abacadaba.com – Who we are?
July 17th, 2009
Abacadaba.com is a web consulting company for Nik Bear Brown, a UCLA Computer Science PhD. Abacadaba.com does teaching, web development or data analysis/data mining jobs. Contact us at info@abacadaba.com if you have a web project that requires more machine learning, data mining, or text mining expertise that is typically found in web programmers.

Mr Brown’s UCLA Computer Science research focus is on knowledge aggregation for biomedicine. Knowledge aggregation involves Internet programming, machine learning, data mining, text mining, distributed algorithms, and Web search. His major field is Computational and Systems Biology and my minor fields are Artificial Intelligence and Statistics.
Nik Bear Brown – Partial CV (full CV availible upon request):
Education:
Ph.D. student in Computer Science, University of California, Los Angeles.
Fall 2007 to current. PhD expected Fall 2009.
GPA 3.6
M.S. in Computer Science, University of California, Los Angeles. December 2005. GPA 3.6
B.A. in Biochemistry and Molecular Biology, University of California, Santa CruzFellowships:
NSF IGERT Fellow – Integrative Bioinformatics Training Program 2000-2003
Data Mining Skills :
Extensive experience in the analysis of gene expression data and the text mining of public data sources, data analysis and data mining. Strong database development and web development skills. I’ve used Python quite extensively in my own text mining research as well as for text mining/web robot and database development.
Web Developments Skills :
I’ve taught “Programming for the Internet” for UCLA’s Department of Mathematics (PIC40A), CS 31 Introduction to Computer Science I and CS 32 Introduction to Computer Science II (first and second quarter C++) for UCLA’s Department of Computer Science. I have extensive knowledge of: Python, C++, Objective-C (iPhone), Php, Perl, LAMP (Linux, Apache, MySQL and PHP/Python), SQL, web development frameworks (Zend Framework, Ruby on Rails, Symfony, Django), Php templating engines (Smarty, PHPTemplate, XTemplate), JavaScript libraries (Dojo, Mootools, Prototype, MochiKit, script.aculo.us, Yahoo! User Interface Library – YUI, Google Maps), version control (Git, Subversion, CVS), CSS, XHTML, Javascript, Php, Ajax, JSON, XML, XML Schema Development, XSLT, DTD’s, DOM and Web Standards.
Relevant Course Work:
Computer Science
UCLA Computer Science 118 – Networking
UCLA Computer Science 130 – Software Engineering
UCLA Computer Science 161 – Artificial Intelligence
UCLA Computer Science 180 – Algorithms
UCLA Computer Science 181 – Complexity & Automata
UCLA Computer Science 240 – Data Bases
UCLA Computer Science 249 – Data Mining
UCLA Computer Science 263A – Statistical Language Processing
UCLA Computer Science 263B – Connectionist Language Processing
UCLA Computer Science 268 – Machine Perception
UCLA Computer Science 269 – Artificial Intelligence
UCLA Computer Science 286L – Biological Modeling
UCLA Computer Science M296A – Mathematical Modeling in Medicine
UCLA Computer Science M296B – Optimal Parameter EstimationMathematics & Statistics
UCLA Mathematics 131A – Real Analysis
UCLA Mathematics 151A&B – Applied Numerical Methods
UCLA Statistics 165 – Data Mining
UCLA Mathematics 170A – Probability Theory
UCLA Statistics 180 – Bayesian Statistics
UCLA Biomathematics 203 – Stochastic Models in Biology
UCLA Statistics 216 – High Dimensional Data Analysis
UCLA Biomathematics 220 – Kinetic Steady State Models
Statistics M254 – Statistical Methods in Computational Biology
UCLA Mathematics 270A – Mathematics of Scientific Computing
UCLA Biostatistics 278 – Analysis of DNA Microarray DataBiology & Chemistry
UCLA Chemistry 202 – Bioinformatics
UCLA Microbiology CM233 – Biotechnology
UCLA Microbiology CM234 – Ethics in Biomedical Research
UCLA Physiological Science 235 – Dynamical Systems Modeling
UCLA Human Genetics 236 – Advanced Human Genetics
UCLA Physiology 250C – Critical Topics in Physiology
UCLA Microbiology CM233 – Biotechnology
UCLA Chemistry M252– Advanced Methodology in Computational Biology
UCLA Pathology 255 – Mapping the Human Genome
UCLA Microbiology & Immunology M262A – Immunobiology of CancerComputer Skills:
Databases
Oracle (UCLA Extension Oracle 8 Database Administration Certification – Summer 2000)
Microsoft Access, SQL, ODBC, mySQL, PostgreSQL, UML, XMLComputer languages
HTML/XHTML/XML (7+ years experience)
Python (5+ years experience)
C++/C (5+ years experience)
Perl (5+ years experience)
Javascript (3+ years experience)
Php (3+ years experience)
Other langauges LSL (Linden Scripting Language), Ruby, Java, Lisp, ColdFusion, Shell Scripts, S-plus, R, MatLabSoftware Design
I understand and develop requirements and design documents. I understand object-oriented programming, design patterns and service-oriented architectures.
Development IDE
Eclipse, Emacs, Notepad++
Operating Systems
Windows 9x/NT/XP, Macintosh, UNIX, LinuxGraphics
Extensive experience with Adobe Illustrator and Photoshop. SVG programming. ImageMagik, Apache Batik.Other
I understand Ajax, XHTML, Javascript, DOM programming, Web Services, Service Oriented Architectures, MVC Architectures. -
Abacadaba.com – What we do?
July 16th, 2009
Abacadaba.com writes backend tools (usually in Python) that aggregate knowledge for web sites. Knowledge aggregation involves Internet programming, machine learning, data mining, text mining, distributed algorithms, and Web search.
Usually we display that data is via a Wordpress frontend but as these tools interact directly with a database any front-end can be used.
If you have a web project that needs data or automated tools please contact us at info@abacadaba.com